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PREFACE 


THE present century has seen extensive developments in methods of 
measuring scientifically the traits and abilities of human beings. 
Three main influences may be discovered behind this movement. 
First, there is the tremendous interest which people have always 
shown in judging one another’s qualities. Much of their conversation 
consists in summing up their acquaintances. Their plans, either for 
work or play, are very largely guided by their notions regarding the 
persons with whom they are about to deal. It is obvious that such 
notions are frequently biased or inaccurate, and yet each one con- 
tinues to rely, in great affairs or small, chiefly on his own insight into 
the characters of others. However, it is recognized that in certain 
spheres of human activity, more accurate and impartial assessments 
are essential. The examination system, now firmly established in 
schools, universities, and in many professions, is supposed to meas- 
ure the achievements and capacities of pupils, of students, and of 
candidates for various jobs. Although this system is certainly an 
advance on the earlier system of personal judgments, yet much 
doubt exists as to its efficiency. A considerable section of this book 
is therefore devoted to a discussion of the defects of examinations, 
and of present-day methods of marking pupils’ or students’ work, 
and to a study of how these may be improved, 

Secondly, mankind is interested, not only in individuals, but also 
in general problems of human nature and human institutions. Thus 
another field where accurate measures of human qualities are needed 
is the scientific investigation of mind, behaviour, and society. Loose 
thinking and prejudice are extremely prevalent in matters of politics, 
economics, religion, crime, and the like—as is shown, for example, 
by Thouless in his book Straight and Crooked Thinking (1930). Yet 
the social sciences, such as psychology and sociology, are endeavour- 
ing to study these subjects rationally and impartially, in the same 
manner as the physicist studies the atom, or the physiologist the 
workings of the body. Now the data from which a science is built 
up are very largely numerical and quantitative. The records of a 
chemical or physical experiment usually consist of measures of 
weights, times, distances, galvanometer or thermometer readings. 
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Measurement is almost equally important for purposes of precise 
description and experiment in the social sciences; and the deduction 
of sound conclusions from such quantitative data frequently involves 
the application of statistical treatment. In this connection, then, we 
shall consider below the nature of measurement in psychology, and 
the elements of statistical technique, with especial reference to the 
field of education. 

_A third source of the mental testing movement lies in the desire 
for human betterment. Medical science has advanced enormously in 
its ability to diagnose and treat bodily ills, but psychiatry and psy- 
chology are only beginning to learn how to handle mental abnor- 
malities. Yet there are already, in addition to the mental hospitals, 
numerous clinics where the diagnosis and treatment of emotionally 
maladjusted patients—adults and children—are carried out. Special 
schools, and tutorial classes in some ordinary schools, play their 
part by providing a type of education suited to children who are too 
greatly handicapped, mentally or physically, to make progress under 
ordinary school conditions. Industrial schools and the Borstal system 
aim at re-educating delinquents and budding criminals, so as to turn 
them into useful members of society. The staffs of all these institu- 
tions need accurate methods of measuring the educational and 
Occupational capacities of the ‘patients.’ Actually many of our most 
valuable mental tests were first constructed by workers in the fields 
of educational, emotional, and vocational guidance. 

This book is thus intended primarily for three classes of students, 
namely teachers and examiners, psychologists and others who set 
out to investigate human phenomena from the scientific standpoint, 
and doctors and psychologists who wish to make practical use of 


tests in clinics and schools. For reasons of space, no attempt is made 
to describe the various mental tests in detail, but a classified list of 
those used i 


n this country is included in Chap. IX. Many authors 
have emphasized the value of tests without sufficiently drawing 
attention to their dangers and difficulties, Thus the main object of 
this book is to outline the principles of test construction and applica- 
tion or procedure, and the interpretation of the results of tests and 
examinations, ignorance of which on the part of testers has fre- 
quently led to serious errors. Mental measurement undoubtedly needs 
a high degree of skill, which few of the persons just mentioned have 


either the time or the Opportunity to acquire thoroughly, before they 


Set out to test or examine, 
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There are also available many excellent textbooks on statistical 
methods, perhaps the most useful to testers being those of Garrett 
(1953), Dawson (1933), and Guilford (1936). Yet in the present 
writer's experience, all of these are too difficult for the majority of 
students whose psychological training only amounts to between 
seventy-five and one hundred and fifty hours. This book aims, 
therefore, to describe the minimum essentials, to show their imme- | 
diate application to psychological (and especially to educational) 
problems, and to stress their general significance rather than their 
technical details. 

More advanced students are unlikely to find within much that is 
original. But, so far as the writer knows, no previous account of the 
statistics of marking, nor of the construction of new-type examina- 
tions, has been published by a British author. The analysis of educa- 
tional measurement and the discussion of the technique of eimining 
are partly new but owe a great deal to Hamilton’s brilliant book, 
The Art of Interrogation (1929). Possibly also some of the practical 
hints about tests and testing have not been published before. 

Lack of space has necessitated some important omissions, such as 
the whole field of temperament and character assessment, the testing 
of special (vocational) aptitudes, and of sensory-motor capacities. 
Actually few of the tests falling under these headings are well stan- 
dardized, and fewer still can safely be applied by those who are not 
trained psychologists. 

PARIN: 


THE UNIVERSITY 
LASGOW 
Julv 1939 


PREFACE TO THE SECOND EDITION 


THE general principles of mental testing and elementary statistical 
analysis have changed but little in the sixteen years since this book 
was written. But the educational context in which they were origin- 
ally set has altered considerably, notably with the passing of the 1944 
Education Act. This edition, then, brings the book up to date in 
tespect of selection for secondary education, though it is still, of 
Course, concerned only with the critical study of methods—not with 
the selection problem as a whole. Next, there have been important 
developments in the field of intelligence theory and intelligence 
testing, Фа it is hoped that the views put forward here represent a 
more reasonable picture of the relations of innate and acquired 
abilities and of the findings of factorial analysis than has hitherto 
been available to education and psychology students. The compre- 
hensive list of published mental tests has been completely recon- 
structed. Several of the statistical sections have been revised and 
slightly expanded to cover more of the problems which the examiner, 
mental tester, or research worker are likely to meet, while avoiding 
as far as possible any increase in complexity. Thus in general the 
book covers the same ground as the original edition, at much the 
same length, but in a more up-to-date and more accurate manner. 
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CHAPTER I 


INTRODUCTION TO STATISTICAL METHODS 


Ir is an unfortunate, yet indubitable, fact that the majority of 
students of psychology and education find extreme difficulty in 
grasping and applying the statistical concepts which are an essential 
prerequisite to sound mental measurement. They take fright at the 
mere mention of statistics or at the sight of statistical symbols and 
formule, A small minority, on the other hand, find in the manipu- 
lations of numbers which we call statistics an intense fascination, 
closely akin to the musical composer's fascination with interweavings 
of sounds. Probably statistical ability is as specialized and as rare as 
is ability in musical composition, and the expert statistician rather 
easily loses touch with the doubts and difficulties of the beginner. 
He fails to realize that the statistical way of looking at educational 
or psychological phenomena, which comes so naturally to him, is 
at first entirely foreign to the average student. Yet the student’s 
‘phobia’ is certainly unnecessary, since the elementary statistics 
needed for the proper treatment of marks and test scores includes 
little more than the kind of algebra and arithmetic which he mastered 
by the age of 14. Statistics means nothing more than numerical data, | 
such as measurements, examination and test results; and statistical! 
methods are merely the ways of dealing with, or organizing, these! 
data so as to bring out their full significance. 

The raison d’étre of statistical methods is that human beings are 
d every measurable 


apt to differ so widely from one another in any an 
characteristic, that results or conclusions which apply to one person 
‘or one group of persons (e.g. one school class), donot necessarily apply 
to another person or group. First, then, we must have efficient tech- 
niques of arranging or tabulating the scores and marks obtained from 
different people,in orderto seewhatisthe general run of. theirscores,and 
what is the range or extent of the differences. These techniques are 
described in the present chapter. The ‘general run’ and the ‘range or 
extent’ of scores next requiremore precise definitionand measurement; 
that is to say, we must know how to work out averages and indices 
of what is called spread, scatter, dispersion, or variability (Chap. IIT). 
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Not only do human beings differ from one another, but also Kon 
individual varies in his capacities and characteristics from Шы. M 
time. The score he achieves to-day may or may not be typica 2 В. 
usual accomplishments. The same holds true of a class of pupils: т 
students. The extent of this untrustworthiness or unreliability ( b 
is generally termed) of a single measurement or of a set of ац 1 
ments therefore needs to be determined. Statistical treatment HE Er 
us whether any alteration or variation in scores represents M m 
important change—such a change, for example, as we might nae * 
bring about by means of the teaching given to the individua о is 
the class—or whether it is merely a matter of chance fluctuatio ni 
A further very common and important problem is the ТА 
significance of a difference between two groups of PSOE cdd 
instance, girls may appear to be better than boys at ME ү 
this is not true of all girls and all boys; so we must have a metho ds 
establishing how general and how reliable is the difference (Chap. x 

Still another type of variation requires special techniques o 
investigation, namely, the variations in people's capacities Е 
ent spheres. It is obvious that no one is even all-round in his leve 
of abilities. Yet we are apt to assume that high achievement in Г 
field implies high achievement in other fields. Sometimes also we in er 
the opposite type of relationship—that if a person is strong ard 
trait, then he will be lacking in some other trait. The statistica 


i its or | 
methods of correlation enable us to say how far various traits ог, 


abilities “go together’ or are related to one another, They also tell ~ 
how far our tests or examinations measure what they are suppose 
to measure, i.e. how valid they are. For example, a Certificate or 
Matriculation examination at the age of 16-18 may, or may not, 
validly predict success in a subsequent university career (Chaps. 
VI-VIII). 

Although statistical methods-are, 
their limitations should also be r 
discussion. They are tools for hand 
not capable of improving data whi 
biased. When an investigator use: 
when he observes and records peo 
wrongly, no amount of sta: 
Such deficiencies, Further, 
tical meth 
mass-inve: 


then, of undoubted importance, 
ecognized at this stage in our 
ling numerical data, but they are 
ch may be initially inaccurate or 
S a poor test or examination, ог 
ple's behaviour or mental processes 
tistical treatment of his results can correct 
the fundamentally abstract nature of statis- 
ods needs to be stressed. They are adapted primarily for 
Stigations of measurements obtained from large numbers 
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of persons, and such investigations are apt to lose touch with the 
concrete individual case. Without them we could not make sound 
generalizations about human beings, but even with them we cannot 
usually state whether a generalization applies to a particular human 
being. The practical handling and understanding of an actual child 
or adult is still an art rather than a science, though it may be greatly 
assisted by scientific mental tests and guided by the results of objec- 
tive psychological experiments and statistical studies. 


TABULATION OF MEASURES 

Measures and Variables.—By a variable we shall mean an ability 
or some other characteristic in respect of which human beings vary 
or differ from one another. Height, age, income, intelligence, 
examination success, are all variables in this sense. A measure is the 
numerical expression of a certain standing on, or amount of, a 
variable. Fifty inches of height, five pounds a week income, and a 
mark of 6 out of 10 are examples of measures. 

Most variables are continuous; that is to say, they can vary by 
infinitely small gradations. But we only measure them to the nearest 
convenient unit. For example, heights may be measured to the near- 
est tenth of an inch. But actually 50-1 inches includes all heights from 
50:05 to 50-15; 50-2 inches includes from 50:15 to 50-25, and so on. 
Similarly, a school mark of 6 is usually an approximation represent- 
ing all degrees of quality from 5} to 64. If, however, half-marks are 
awarded, then 6 represents 5$ to 64, and 64 represents 64 to 63, and 
‘so on. But some variables are discrete or discontinuous, since they 
only vary by units. For instance, if the size of a class of pupils is 40, 
this does not imply between 39} and 404 pupils. The scores for many 
tests, and some school marks, are of this type, for they represent the 
exact numbers of questions or test items correctly answered. Thus a 
pupil would score 6 out of 10 for arithmetic sums even if he had 
nearly completed a seventh sum. (Such measures are referred to in 
the next chapter as count or enumeration scores.) 

This distinction between continuous and discrete variables needs 
to be borne in mind when tabulating measures (cf. p. 5). 

Tabulation.—Tabulation means arranging a set of measures in an 
orderly fashion, so as to convey the essential facts about them in 
convenient and relatively brief form. If we take a class of pupils, and 
beside each pupil's name write his age, or exam. mark, we get a list, 
but it is not a fable, since the measures are quite haphazardly 


4 THE MEASUREMENT OF ABILITIES 


arranged. It is very difficult to see from such a list, if it be a long one, 
what is the general run and Tange of measures, or whether ee 
particular measure (e.g. John Smith’s 6 out of 10) is high, low, o 
average, 

i RS way of tabulating is to rearrange all the we. 
in order of size from highest to lowest; that is, to put them in ran 
Order. The desired information can then be grasped much UM 
readily. But this is a laborious process if the number of pupils i 
large. . 

же Distributions —Under such circumstances it is better Е 
write in a column each Possible measure, in order, and beside itt d 
frequency, F; that is, the number of persons who obtain Hees 
measure. The result is called a frequency distribution, since it show. 


ee e 
how the various measures are allotted or distributed among th 
persons, 


Ex. 1—Тће following is a list of the marks (out of 20) which 
warded t 


were a a class of 50 pupils, The table below shows their 
frequency distribution. Marks: 10, 


20, 17, 19, 14, 9, 11, 17, 18, 8, 1, 
20, 15, 18, 17, 13, 11, 4, 16, 13, 


, 7, 5, 17, 15, 13, 4, 6, 9, inl 
9, 12, 19, 16, 15, 17, 14, 13, b > 
8, 14, 18, 15, 12, 17, 16, 15, 17, 2. 


Measures F 
20// 0. 2 
ID /] 2 
Tay ; 
5]. У 
14 | |] | 3 
13111]. 4 
ОНА, 2 
11 // 2 
10 // 2 
9 [[[ 3 
81] 2 
571) 1 
y | 
41] 2 3 
3 0 
2 1 1 
1] 1 


2 
|| 
© 


P ере o 


"scores ranging from 
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Hi ints on Drawing Up a Frequency Table.—Do not look through 
the original list and pick out all who got the highest measure, e.g. 20, 
then all who got the next highest, 19, and so on. This would bea 
waste of time and would easily lead to mistakes. Instead go through 
the measures in the list consecutively, and as each one is reached, put 
a tally or check mark opposite the appropriate measure in the 
column. When any one measure recurs for the fifth time, draw a line 
through the first four tallies. Repeat this on reaching the tenth, 
fifteenth, etc., tally. Add up the tallies to obtain the F column. Add 
up the F column itself to make sure that it checks with N; the total 


number of persons. 

Distributions of Classified Measures. 
likely to produce an over-lengthy table, especially when N is large. 
It would be absurd, for example, to tabulate the heights of all the 
TE of a school in this manner. Instead of listing the frequencies of 

ildren 40:0, 40-1, 40-2 . . . inches tall, we would give the totals 
with heights of 40, 41, 42 . . . inches, or the totals with heights of 
40 upwards, 42 upwards, 44 upwards, and so on. That is, we would 
group together sets of contiguous measures under a series of classes 
or categories. 
fy Considerable care is needed in deciding upon what is, or is not, 
included in a class. When the variable is discrete—for example, test 
10 to 93—we may call our classes 10-14, 15-19, 
20-24, etc., ог else 10 +, 15 +> 20 +, etc. Both of these indicate 
clearly that all scores of 10, 11, 12, 13, and 14 are included in the 
first class, and that the size of each class is five units. The same 
methods may be used for school marks. For instance, 10%-14%, 
159/219 7 «cere is unambiguous, although, strictly speaking, the 
includes all shades of merit from 94% to 143%. But with 
ables such as age or height, difficulties often 
es are themselves only approxima- 
sures (cf. p. 3). Thus if height 


— The above procedure is still 


first class 
some continuous vari 
arise because the original measur 
tions which represent classes of теа 
has been measured to the nearest tenth of an inch, and is tabulated 
to the nearest inch, we should avoid calling our classes 40, 41, 42 
сес Бог 40 might include 39-95 to 40:95, or 39-45 to 40-45, or 
39-55 to 40:55. If we write 40 +, 41 +, ete: the first of these is usually 
intended. With ages, 54,64+.-- is quite clear, since the 5 + 
class includes all children who are at, or who have passed, their fifth 
birthday, but have not yet reached their sixth. 

No set rules can be laid down as to the choice of classes, but the 


M.A.—2 


=, 
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following additional hints will apply to most of the tables which 
teachers or psychologists will need. 

(1) The number of different classes is generally somewhere between 
6 and 15. Some tables contain more, few less. When the total number 
of persons, N, is only 50 or less, a small number of classes is generally 
sufficient; but if N is nearer 500, a more detailed table with a larger 
number of classes may be desirable. Кк 

(2) Pick out from the original list of measures the highest and the 
lowest measure to be tabulated, and deduce from them the total s 
Tange of the distribution. If the range is from 20 to 1, then 7 Classes » 
containing 3 different measures each are likely to be appropriate. If 
the range is from 137 to 65, then 15 classes of 5 are a possibility. 
In most instances all the classes should contain the same number of 
different measures, i.e. they should be of the same size. 

(3) It is desirable, though not essential, to include an odd number 
of different measures in each class. The reason for this is that in any 
subsequent calculations we are going to regard all the measures 
grouped within any one class as lying at the midpoint of that class. 
If the classes include, say, 3-7, 8-12, 13-17, 18-22, etc., then we 
shall count all the 3’s, 4's, 5's, 6%, and 7’s as 5’s, and all from 8 to 12. 
as 10’s. The midpoint of a class is found by taking the average of th. 


highest and lowest measures it can e. СА + 7) = 2) = 5. 


The same figure is the midpoint of a class who ue limits are 24 
to 74. But a class such as 39-95 to 40-95 has the rather inconvenient 


d yx 


midpoint of 40-45, 

(4) With school marks it isa 
numbers, such as the 5°5 and 107. 
Teachers are very. apt to award 
intermediate ones, Hence, if cla: 
chosen, most of the frequencie: 
and it would be illegitimate to 
eu T Dey t - The diffic 
8-12, 13-17, 18-29 . i 


dvisable to arrange for any round 
5, to fall at the centres of the classes. 
these round numbers and to neglect 
sses of 10-14, 15-19, 20-24 . . . are 
5 may actually lie at 10, 15, 20 . . 5 
regard them as lying at the midpoints 


ulty will be surmounted if classes of 
- are used instead. 


classes and written them out in a column, 
‘tally’ method, as before, and check the 
total, N. Э 


it tos 
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Ex. 2.—Below is the distribution of the measures already listed 
in Ex. 1. They have been grouped into classes of 3 marks each. 


Classes 

T8220 EE elt e EIE 

SST Toe et eee 
eles о $ - 


is) 
ee 
- e 
— 
EM 
ко шә CA —1 NO ~ ~ "rj 


N= 50 
GRAPHICAL REPRESENTATION OF DISTRIBUTIONS 
A distribution may often be portrayed more clearly by means of 
a graph, either of the type known as a histogram, or by a frequency 
polygon. In both of these the frequencies are plotted on the vertical 


012 83845 6789101 


Fic. 1.—Histogram of distribution from Ex. 2. 


12 13 14 15 16 17 18 19 20 x 


1 The ogive type of graph, or cumulative frequency curve, will be omitted 
here. The reader may find accounts of its construction and uses in more advanced 
textbooks, such as Garrett (1953) or Dawson (1933). 
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(Y) axis against the measures, or classes of measures, on the E 
tal (X) axis. The scale of the graph and the location of the origin ( А 
point of intersection of the axes) are purely matters of PU RET 
Suppose, for example, that our graph paper measures 6 inches x 

inches, each inch being divided into tenths, and that we wish to plot 
the distribution given in Ex. 2. We might use a scale of 3 persons to 
the inch on the Y axis, and one class (3 marks) to the inch on the X 

B 


F 
18 


ЈЕ a 2890 1 eee 
| 4 7 [в] 13 16 9 x 
—Incomplete frequency polygon of distribution from Ex: 2. 


axis. This has been done in Figs. 1-3. Fig. 4 portrays two distribu- 
tions of intelligence test scores which have been grouped into classes 
SERES , 133-137. Each inch on the Y axis represents 5 
persons, and each inch on the Y axis includes 2 classes (10 points) 
of score. 

A completed 
horizontal dime 


FIG. 2, 


The origin i = 0, but there is no need for this. 
Thus in Figs, 


2-3 IUS at х К 2, in order that the Y axis may 
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be drawn at the left-hand edge of the graph paper. And in Fig. 4 it 
is at X — 100, in order that the Y axis may be in the centre of the 
page. 

The Histogram or Column Diagram.—Fig. | is a histogram por- 
traying the distribution given in Ex. 2. It consists of a series of pillars 
or columns drawn side by side. Each pillar represents one measure 
or, as in this instance, one class of measures, and its height represents 
the frequency of this measure or class. The marks from 0 to 20 are 
written along the X axis, but note that each mark is represented by a 


5 


y is 


o ЛОЛЕ ZR n 5 


I 4 7 10 15 16 19 22 x 


Fic. 3.—Completed frequency polygon of distribution from Ex. 2. 


space of 4 of an inch on this axis, not by one of the inch or tenth- 
inch Jines. The first class (including 0, 1, and 2 marks) is, therefore, 
represented by the whole width of the first inch, the second class (3, 
4, and 5 marks) by the whole width of the second inch, and so on. 
— All the measures included in a class are 


The Frequency Polygon. [ 
here represented by a point on the graph, whose Y co-ordinate is the 


frequency, and whose X co-ordinate is the midpoint of the class. 
Thus, in Fig. 2, which represents the same distribution as before, the 
midpoints are written along the X axis, and this time they are located 
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either at an inch or tenth-inch line. The points on the graph are 
connected up by straight lines. Note that it is incorrect to smooth 
the lines into a curve, like those in Figs. 10-14. The figure is, as it 
were, suspended in mid-air, hence it is usual to complete it (Fig. 3) 

by connecting the first and last points with additional points whose 
' Y co-ordinates are zero. The X co-ordinates of these additional 
points are the midpoints of the classes just below the lowest, and just 


above the highest classes, in the distribution, ie. X = — 2 and 
Х= 22. т 


Ex. 3.—For further illustration there follow the frequency distri- 
butions of scores obtained by 136 men and 114 women students on 
the Cattell Intelligence Test, Scale ША. The scores are grouped into 
classes of 5. The distributions for the two sexes are plotted as 
frequency polygons on the same graph (Fig. 4), and for the sexes 
combined as a histogram (Fig. 5). Note in the latter that the scale 


along the Y axis has been halved, since combining the sexes roughly 
doubles the frequencies to be plotted. 


Frequencies 


Test Scores Men Women Combined 
133-5: al Ji 0 1 
ЗЕЕ: 0 1 1 
123 + 5 1 6 
118 4- 8 3 11 
113 4- 13 9 22 
108 -+ 17 9 26 
103+ . QU 14 35 

98-- . s 19 30 47 
ӘЗ е #23 20 43 
88 + 14 13 27 
83 + 6 12 18 
78 -- 8 2 10 
73 + 3 0 3 

N = 136 114 250 


Relative Merits of Frequency Polygons and Histograms.—Fre- 
quency polygons possess the advantage over histograms that two or 
more distributions can be superimposed and compared on one graph, 
as in Fig. 4. But they are inferior in that the lines connecting suc- 
cessive points give a somewhat inaccurate notion of the distribution, 
especially when the number of points is small. For instance, in Fig. 4 
the connecting line between the two points X = 110 F = 17 and 
x = 115 F = 13, suggests that 15 men students obtained inter- 
mediate scores of 1124, which is quite untrue. In this respect the 


measures are those which are practically ind 


Scores on à standardized educational or іп 
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histogram is unexceptionable, since it gives an exact portrayal of the 
tabulated data. The total area enclo: 


sed within a histogram corre- 
sponds accurately to the total number of p 


ersons, N. The area within 
а polygon corresponds only approximately. Histograms havea further 
advantage, namely, that they can be employed when the yariable is 
not a truly quantitative one. For example, examination scripts are 
often graded, not with n 


umerical, but with letter, marks—A, A —, 
B +, etc. The frequency distribution of the numbers of examinees 
who are awarded each letter c 


an be shown by a histogram. 
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Fic. 4.—Frequency polygons of distributions from Ex. 3. 
Nen auc m Women. 


CHARACTERISTICS OF FREQUENCY DISTRIBUTIONS 
Smoothness of Distributions.—Here We must distinguish provi- 
sionally between ‘objective’ and ‘subjective’ measures, and discuss 


the difference more thoroughly in the next chapter. Objective 
ependent of the person 
who makes them, such as heights measured with a ruler, or the 


telligence test. Subjective 


wn 
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75 80 85 90 95 100 105 НО 115 120 125 130 155 x 
Fic. 5.—Histogram of distribution of men’s and wi 


omen's scores 
combined, from Ex, 3. 
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Fic. 6.—Distribution of test scores of а small group, 
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measures, by contrast, depend upon personal evaluations. They 
include, for example, the marks awarded by a teacher to a series of 
English compositions, assessments of pupils’ characters, and the like. 

Now when graphs are drawn of frequency distributions of objec- 
tive measures, certain characteristic features are very commonly 
found. First, the graphs are generally very irregular when the fre- 
quencies are small, but the irregularities tend to become ‘ironed out’ 
as the numbers increase. And when ЈУ reaches a value of several 
hundreds, the frequency polygon or histogram approximates to a 
smooth curve. Thus the distribution in Fig. 5 is much more regular 
in its rise and fall than is the distribution in Fig. 6, which represents 
the scores of a group of 50 students, selected at random from the 


previous 250. 
Fig. 7, which is a frequency polygon of the heights of 6,194 male 


adults, gives an almost smooth curve. 
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Fic. 7.—Frequency polygon of hei 


ights of 6,194 English male adults. 
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Fic. 8.—Irregular distribution of test scores grouped in small classes. 


F 
80 


60 
40 


20 


о 


65 75 85 95 105 15 125 155 145 x 
Fic, 9.—More regular distribution of test scores 


grouped in large classes. 
14 
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Frequencies may be increased not only by increasing the value of 
N, but also by grouping the measures into larger classes. Figs. 8-9 
show frequency polygons for the same set of 250 measures grouped 
into classes of 3 and 10 respectively. The latter is distinctly smoother 
than the former. 

Types of Distribution Curves.—In actual practice we can never 
obtain measures of a large enough number of persons for the distri- 
bution graphs to become completely smooth curves. Nevertheless, 
it is customary to represent various types of distributions pictorially 
by means of curves of appropriate shapes, to which obtained distri- 
butions approximate more 
orlessclosely. Forexample, 
we can say that Fig. 10 
represents roughly the type 
of distribution which might 
beexpected for the incomes 
of parents of elementary 
school children. The great- 
est number would prob- £500 £1000 £1500 £2000 INCOME 
ably lie between £300 and Fic. 10.—Skew curve, illustrating approxi- 
£800 per annum, and the mate ШОШЫП qe | rinnal income 
frequencies of those at ye he Or Anais I ot ane 
higher income levels would 
progressively diminish. This particular type of distribution is called 
a skew curve, because of its asymmetry. It is positively skewed 
towards the upper end, and shows a piling up of cases at the lower 
end. A bunching at the upper end would give a negatively skewed 
curve, 

Fig. 11 is a bimodal curve such as might be obtained if we graphed 
the intelligence quotients? of pupils in two schools, one containing 
very intelligent children of professional class parents, the other con- 
taining inferior children of very poor stock. Most of the I.Q.s fall 
either between 110 and 140 or between 70 and 90, but there are a 
few of even higher or even lower intelligence, and a few (probably 
drawn from both schools) of intermediate or average intelligence, 
With I.Q.s of 90 to 110. 


The two polygons which appear in 


roughly what is meant by intelligence 
d on p. 19, and a more precise account 


F 


Fig. 4 are irregular on account 


‘It is assumed that all readers know 
quotients (1.0.5). A definition may be foun: 
of their meaning on p. 194. 
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Fic. 11.—Bimodal curve, illustrating in- 
telligence quotients of mixed groups. 


of the small frequencies 
involved. They approxi- 
mate, however, to a pair of 
curves such as those shown 
in Fig. 12. Note here that the 
more highly peaked curve 
(representing women’s 
scores) does not indicate 
that women obtain, on the 
whole, higher scores than 
men, but that women obtain 
more scores close to the 


average of 100 points, and fewer extreme scores; in other words, 
that men are more dispersed or spread out in their ability. 

It should be remembered that the areas enclosed beneath distribu- 
tion curves are proportional to the total numbers of cases involved. 


то во 90 100 по Ro Bo 
Fic. 12.—Two distributions of intelligence 
test scores, with different dispersions. 
numerous as, the former. It i 
different sizes are being com 


70 $80 90 оо по mo 130 


Fic. 13.—Distributions of intelligence test 


Scores of two groups, one much larger 


than the other but 
intelligence of lower average 


In Fig. 12 the areas are 
roughly equal, since the total 
numbers of the two sexes are 
about the same. Fig. 13 shows 
the curves to which the dis- 
tributions of scores among 
Honours andamon ig ordinary 
degree students approximate. 
The latter have, on the 
whole, somewhat lower scores 
than, but are four times as 


S more usual, when groups of such 
pared, to plot percentage frequencies 


instead of actual frequencies, 
since the curves will then pos- 
sess the same area, and differ- 
ences in the averages and 
ranges or dispersions of meas- 
ures will stand out better. 


THE NORMAL FREQUENCY 
DISTRIBUTION 

Whenever a large group of 

human beings is taken at ran- 


————— РНИ 
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dom (i.e. not specially selected like the two schools whose I.Q.s are 
graphed in Fig. 11); then objective measures of almost any trait or 
ability they possess—physical or mental—will be found to conform to 
acertain type of distribution known as the normal curve. This is often 
referred to as the normal probability curve, the normal curve of 
error, or the Gaussian curve. It is symmetrical, bell- or cocked-hat- 
shaped. It indicates that the majority of persons in the group obtain 
measures round about the average, relatively few obtain extreme 
measures, and that the frequencies above the average are the same 
as the frequencies below it. The four curves in Figs. 12-13 belong to 
this type, and all the graphs from Fig. 4 to Fig. 9 roughly approxi- 
mate to it, since they represent distributions of scores obtained by 
unselected groups of persons. Innumerable instances of distributions 
which tend to this normal shape could be cited: the heights of chil- 
dren in an ordinary school class, the numbers of times they can tap 
with one finger in a minute, or the numbers of words they can write. 
Variations in individual performances as well as differences between 
individuals show the same trend: if a person tries a hundred times 
to copy a line of a certain length, most of his attempts will be close 
to his own average, and a few will be considerably too long or too 
short. Differences between groups are also similar: if all the ele- 
mentary school classes aged 11 + ina large city are given an objec- 
tive arithmetic test, and the average mark of each is calculated, then 
a graph of all these averages may be expected to show a normal 
distribution. 

Conditions which Upset the Tendency to Normality.—There are 
several factors which may disturb this normal tendency. First, the 
total frequencies involved may be so small that the irregularities, 
noted above, may mask it. It is usually apparent, however, even with 
groups of 50 or less (cf. Fig. 6). Secondly, the group of persons may 
be selected in such a way as to distort an otherwise normally distri- 
buted variable. For example, a school class which has been ‘creamed’ 
will be likely to show a negatively skewed distribution of test scores 
or examination marks, since there will be a deficiency of pupils of 
high ability. The police force or the army will not show a normal 
distribution of heights, if no recruits are accepted below a certain 
height. 

Thirdly, there are a few human characteristics which, although 
objectively measurable, are known to be abnormally distributed. 
Accident proneness is one such variable, It is found that a few people 
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F are liable to undergo a large 
number of accidents in 
any given period of time, 
whether at home, or in fac- 
tories, or when driving cars. 
But the majority have few 
or по accidents. Thus the 
distribution takes roughly 
the form shown in Fig. 14. 

"IN A CERTAIN SE Annual income also gives a 

Fic. 14.—Approximate frequency distribu- Skew distribution, and it is 
tion of accident proneness. likely that other measures 

| . of ‘productivity’ such as 
literary or scientific output are similar (cf. Burt, 1943). It is hardly 
possible to prove that intelligence quotients conform to the 
normal curve, though this is a convenient assumption that accords 
reasonably with commonsense observation. It does break down at 
the bottom end of the scale, since there are more imbeciles and idiots 
of very low І.О. in the population than would be expected. Some 

Investigators have claimed other marked divergences from the nor- 


mal shape, but these might be attributable to inequalities in the units 
of measurement. у 


Such irregularities, or defects in our methods of measuring vari- 
ables, constitute a fourth condition, which is of very frequent 
occurrence, as we shall see in the next chapter. 


* Cf. H. M. Vernon (1936), 


CHAPTER II 


PRINCIPLES OF MENTAL MEASUREMENT AND 
MARKING 


WE are all accustomed nowadays to regarding a scholastic capacity 
(e.g. arithmetical ability), like height or weight, as ranging from a 
very small amount in some people to a very large amount in others, 
and to assessing these amounts in numerical units such as percentage 
marks. General intelligence and other aptitudes can similarly be 
graded as quantitative variables, average intelligence being generally 
denoted by an Intelligence Quotient or Г.О. of 100, superior or 
inferior intelligence by larger or smaller numbers. Yet the conception 
of scaling mental qualities in a manner similar to physical ones is of 
fairly recent origin—we owe it largely to the work of Sir Francis 
Galton—and it is still somewhat limited in application and beset 
with many difficulties. A human ability or trait is far more complex 
than a length, a weight, or a time; it can never be completely repre- 
sented by a single numeral on a scale. Consider, as an example, 
political opinions—a communist is certainly more socialistically in- 
clined than a liberal, and a fascist is more reactionary than a con- 
servative. Successful tests have, therefore, been devised which scale 
these attitudes as a single variable ranging from extreme left-wing to 
extreme right-wing. But such a scale can hardly do justice to all the 
various shades of political opinion; it inevitably involves some over- 
simplification and artificiality. Particularly in the field of tempera- 
ment and character does such distortion arise, for here people appear 
to us to vary from one another, not merely in amount, but also in 
kind ог quality.! The same is true, though to a lesser extent, in the 
field of abilities. Thus the precise nature of intelligence is an extremely 
controversial matter, and the common tendency to regard the I.Q. 
as an accurate measure of some definite and stable mental entity is, 
as we shall see below, an egregious misinterpretation. 

No mental trait or ability can be measured directly by laying some 
kind of ruler alongside it. Instead we have to grade it by observing 
its expression in words or actions. Arithmetical ability, for example, 


1 For a fuller discussion of these points, cf. Vernon (1953). 
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is shown by success at a series of sums; moral character by certain 
modes of behaviour. Indeed, for scientific purposes, we have to regard 
a trait or ability, not as some power or faculty in the mind, but as a 
particular class or category of related performances, e.g. of the 
arithmetical or the moral kind. And special statistical techniques are 
required for deciding just what performances should be included 
under one heading, as representative of one ability or trait (cf. Chap. 
VIII). Since we cannot, of course, observe and record all the per- 
formances which go to make up a mental characteristic, we are 
forced to take samples of them. Our procedure, as Binet pointed out, 
is analogous to that of the mining engineer who cannot examine all 
the ground in a given locality, but instead takes borings at various 
points, and from these samples deduces with reasonable accuracy 
what the district as a whole is like. Similarly, the psychologist 
measures intelligence, or the teacher measures achievement in arith- 
metic, on the basis of the child’s success at sample intellectual and 
arithmetical tasks. An arithmetic examination sets, perhaps, six 
questions, the answers to which constitute a sampling of the general 
field of arithmetical knowledge which the pupil is supposed to have 
acquired. 


How, then, is success at such sample performances expressed in 
numerical terms? The various methods of marking or scoring mental 
tests and examinations may be classified as follows: 

A. Enumeration or count scoring. 

B. Ranking. 

C. Qualitative grading and classifying. 
D. Numerical evaluation. 


E. Mixed numerical evaluation and count Scoring. 


A. ENUMERATION OR COUNT SCORING 

This type of marking is the only objective type, in the sense that 
each person's mark is quite independent of the marker's subjective 
evaluation of his work. When, for example, a series of simple arith- 


metic sums is set, the answers to which are all either right or wrong, 


or when one mark is deducted for each mistake in a dictation test, 


then a pupil’s total mark is a count or enumeration score. Since this 
method is applicable only when no doubt can exist as to the correct- 
ness or incorrectness of the answers which are counted up, it is 
largely restricted to very simple kinds of scholastic work. We shall 
see later, however (Chap. ХП), that attainments in more advanced 
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any point on the scale. E.g. the difference between 50 and 49 inches 
is precisely the same as the difference between 40 and 39 inches. 
Moreover, they always have a zero point: e.g. 0 Ib. means no weight 
at all. Mental ability units often lack these features, and this fact 
precludes us from adding, averaging, or otherwise treating them as 
if they were true numbers. 

For example, a one-year-old child might fail to pass any of the 
Terman-Merrill Intelligence Tests, but his zero score shows, not that 
he possesses zero intelligence, but that the tests are an inadequate 
measuring instrument for this purpose. Again, a child with an 
1.0. of 120 cannot be considered twice as intelligent as one with 
an 1.0. of 60, because we do not know what is the zero point of 
the 1.Q. scale. Even in a rate test of, say, reading speed, a child who 
reads 50 simple words in a minute, and so scores twice as much as 


me, cannot properly be said 


as words read, can 
never be precisely uniform in difficulty. Similarly, the various errors 


weights, some of which are slightly heavier, 
than a pound. When using the machine to wei 


mental test is large, and irregularities 


are small the mental units may be Tegarded as approximately 


equivalent. 
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same amount. For example, each test in the Terman-Merrill Scale 
(from Year VI to Average Adult level) is supposed to represent an 
increment of two months of mental age. But it is generally admitted 
nowadays that mental age units are not equivalent, and that the 
difference between a child of M.A. 134 and one of 13 is probably 
much smaller than the difference between children of 6} and 6. Here, 
then, it is as if we ran out of (nominal) pound weights after a certain 
point, and had to use further weights which progressively increased 
in heaviness. Obviously this leads to grave difficulties in the mental 
age method of scoring (cf. Chap. IV). Several group intelligence 
tests, however, do manage to maintain approximately equal incre- 
ments throughout. 

The units of measurement of many physical skills present similar 
problems. That equal increments of score do not necessarily represent 
equal increments of skill can be seen from the example of golf scores. 
It is decidedly easier for a player who requires 140 strokes for a round 
to improve to 130, than it is for a player who takes 70 to improve 
to 60 (or even—if we could assume golf scores to follow a geomet- 
rical rather than an arithmetical progression—to 65). Here, also, it is 
impossible to establish a zero point of ability, or to calculate ratios 
between abilities. 

Special techniques have been devised by Thorndike, Thurstone, 
and others for the production of measuring scales with equal units, 
and for determining the zero points on such scales. They are too 
complex for description here. But they depend in the main upon the 
principle of normal distribution of objective measures, which was 
enunciated in Chap. I. Well-constructed tests always tend to yield 
a normal frequency distribution; hence we are entitled to regard 
departures from normality as indicative of bad construction or 
bad scoring. For example, a mental test which contains plenty 
of easy and moderately difficult items, but whose later items in- 
crease too rapidly in difficulty to allow sufficient headroom, will 
give a negatively skewed distribution. The analogy with a weighing 
machine would be as follows: suppose that the available weights 
beyond the seventieth became progressively bigger than 1 Ib., and 
that a large group of children were measured whose true weights 
averaged about 70 Ib., and ranged from 50 to 90 Ib. The distribution 
would then be distorted from normal to negative skew type, as in 
Fig. 15. Conversely, a test which contains insufficient easy items, so 

1 For a definition and discussion of mental age, cf. pp. 69-73. 


24 THE MEASUREMENT OF ABILITIES 


that it only differentiates 
effectively among the better 
pupils, is likely to show a 
positively skewed distribu- 
tion. 

Application to School 
Marks and Examinations.— 
It is difficult to apply this 
principle to school marks, 
Fic. 15.—Normal distribution (а), altered since the groups аге usually 

tg negatve skew curve И, by increasing со small (less than 50) that 

great irregularities in distri- 
bution are sure to arise. The following table, however, gives the marks 
obtained by two 11-+ classes in an ordinary school on an objective 
arithmetic test. Owing to the deficiency in difficult questions, many 
pupils at the top got full marks (80), and the real extent of their 
ability was not tested at all. 


aa 
50 60 70 80 90 


Ex. 4.—An abnormal frequency distribution of arithmetic marks. 
Mark 
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Further problems occur in the interpretation of count scores. 
Physical measures, such as 60 inches or 15 seconds, always represent 
definite amounts. No doubt exists as to their largeness or smallness, 
because an inch or a second has an absolute, standard size. Laymen 
and teachers often wrongly suppose that the same is true of mental 
measures, such as the mark awarded to a pupil’s work. They talk of 
15/20 or 60% as if these numbers implied the same level of achieve- 
ment whenever, or by whomsoever, they are awarded. Yet it is surely 
Obvious that such scores, by themselves, tell us nothing at all about the 
goodness or poorness of achievement. Such statements as: “Johnny 
is doing well at arithmetic. He got 8 sums out of 10 correct" 
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or “His dictation is terrible; he made 15 mistakes"—are quite 
meaningless unless we also know the difficulty of the sums and of the 
passage dictated. Johnny may be the poorest arithmetician in his 
class, and yet score 8/10 if sufficiently simple sums are set. Or he 
may be the best speller in his class, and yet make 15 mistakes if the 
passage happens to be taken, say, from a textbook of organic 
chemistry. The speaker is, of course, implying that the tasks are of a 
level of difficulty such that most of the class score less than 8 and 
— 15. In other words, the marks possess only a relative, not an ab- 
solute, significance. Again, when an inspector or head teacher com- 
plains that the average examination mark of a class is *too low,’ he is 
presumably comparing the mark and the difficulty of the examina- 
tion with subjective recollections of the achievements of other 
similar classes. Undoubtedly there is a great deal of loose thinking 
about these matters, and much can be learnt from the more sys- 
tematic procedure of the scientific mental tester. 

The tester realizes that the goodness or poorness of a child’s test 
performance can be interpreted only by comparing it with the per- 
formances of other similar children. Hence he applies his test to 
large numbers of children, similar as to age and other relevant 
circumstances, scores their performances all in the same way, and 
can then state definitely whether a certain performance is superior or 
inferior, and how much better or poorer it is than the average. This 
procedure is called the establishment of norms or standards for the 
test. A discussion of the main types of test norms is given in Chap. 
IV, where it is also shown how the adoption of analogous methods 
may greatly help in the interpretation of school and examination 


marks. 


B. RANKING 

Ranking means listing a group of pupils or students in order of 
their attainment, Ist, 2nd, etc., down to bottom. This method may 
be applied to any form of scholastic work, including composition, 
handwriting, and the like, which are scarcely amenable to count 
scoring. It is actually the soundest of all forms of marking, in spite of 
its many limitations, because these limitations are so obvious and so 
first drawback is that the number of persons 


easily recognized. The 
who can be ranked is rather small. Even the arrangement of twenty 


in order involves a good deal of uncertainty, especially near the 
middle of the list. Those at the extremes are usually more readily 
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discriminated from one another. Next, it is clear that these ranks do 
not provide a true numerical scale, since the numbers 1, 2ysnete;,are 
by no means equally spaced. Usually the differences between the 
attainments of Nos. 1 and 2, 2 and 3, 18 and 19, 19 and 20 (out of 20) 
are greater than the differences between Nos. 9 and 10, 10 and 1— 
hence the difficulty, just mentioned, in ranking the medium pupils. 


y» 4 6 8 [е] 12 14 16 18 20 RANK 


Fic. 16.—Rectangular frequency distribution given by a rank order. 


The frequency distribution is rectangular (Fig. 16), and is entirely 
unlike the normal curve. These numbers, therefore, may give quite 
false impressions if they are added, averaged, or otherwise manipu- 
lated. For example, a pupil who is first in one examination and third 
in another should be regarded as better than, not as equal to, a 
pupil who is second in both (cf. 59) 

A third main drawback is that these numbers are always relative 
to the particular set of persons who have been ranked. Thirtieth 
place is not an unequivocal number like 30* on a thermometer. For 
30th out of 32 is much poorer than 30th out of 50. Moreover, if any 
one person higher on the list should happen to be absent, the 30th 
rank would change to 29th. It is not even analogous to, say, 30 mis- 
takes in a dictation test, since such a test can be applied to large 
numbers of children, and norms can be collected which will tell us 
how low is the level of ability to which 30 mist 
contrast, a rank position c. 
outside standards. 


akes corresponds. In 
an never be so interpreted with reference to 


C. QUALITATIVE GRADING AND CLASSIFYING 
Examination Scripts, essa 


some three to ten gr. 
В +, etc.; o, В, 
Class, etc.; 


Scores (сЁ, Garrett's or Guilford's books). An 
Version, to correlate one 
ranking of school work 
(cf. Chap. VI). 
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primitive type of grading than numerical evaluation (e.g. опа 0 to 10, 
or a percentage, scale), though it is of course often applied to scripts 
which have already been awarded percentage marks. For instance, 
examinees whose mark is 80% or more are informed that they have 
obtained First Class or Credit, and so on. 

Since these grades make no pretence of being numerical measures, 
we need not cavil at the eccentricity of their frequency distributions. 
They are, however, liable to another, still more serious, defect which 
is not found among count scores or ranks, namely that different 
teachers or examiners often employ widely different scales or stan- 
dards of grading. Although they are supposed to represent absolute 
evaluations of achievement, they are actually found by experimental 
investigation to involve such gross discrepancies that neither pupils, 
parents, nor teachers can hope to interpret their significance correctly. 
For example, Hartog and Rhodes (1935) arranged for 48 School Cer- 
tificate French scripts to be graded independently by six experienced 
examiners. One of the examiners awarded 46 credits, 2 passes, and 0 
failures. Another gave 17 credits, 12 passes, and [9 failures. The 
remaining four were intermediate. Similarly, Boyd (1924) had 26 
essays by 11 + children graded by 271 teachers as either Excellent, 
VG +, VG, G +, G, Moderate, or Unsatisfactory. Some of the 
teachers awarded one or other of the two highest grades to as many 
as 15 of the essays, others to none of them. Some also awarded one or 
other of the two lowest grades to 17 of the essays, others to none of 
them. In view of such results, it is clear that there is no agreed 
meaning as to the absolute amount of ability which these gradings 
should signify. A mark of Excellent may represent a high level of 
ability if it is awarded by a teacher who very seldom gives Excellents. 
It may mean little better than average ability if it is awarded by a 
teacher who habitually grades nearly half his papers as Excellent. 
In other words, this type of marking is actually a relative one, like 
ranking. A mark of G +, or 8—, or Pass, etc., only achieves апу 
meaning if we know what proportions of candidates obtained better 
or poorer marks, in just the same way asa rank position of 30th can 
only be interpreted if we know how many candidates were ranked 
lower than 30th. 

Qualitative grading should then be regarded as a form of ranking 
where, instead of there being as many different ranks as there are pupils 
or examinees, there are only some 3 to 10 ranks, any one of which 
may be awarded to several pupils whose work is of approximately 


~ 
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equal merit. Its grave defect is that any one script is not usually 
compared directly with the scripts of a particular group of pupils be- и 
fore receiving its grade; it is merely compared with the marker’s more 
or less vague recollections of the scripts which he has marked in the 
past. The only possible rational meaning for Excellent or First Class 
is that the script which receives this grade seems to him to rank 
higher, or to be relatively better, than almost all the scripts produced 
by similar pupils which he has come across. The other grades are 
similarly referred to rough subjective standards which he has set up 
during his previous marking experience. And since his experience 
naturally differs from all other markers’ experiences, so will his 
standards differ, unless he has taken the trouble to discuss and com- 
pare them with others, and to try to adjust them accordingly. 
Finally, as in the case of rank orders, there is no possibility of 
exact comparison of such grades with external standards or norms, 
and they cannot be added or averaged in any satisfactory way. 


D. NUMERICAL EVALUATION 
and 

Е. MIXED NUMERICAL EVALUATION AND COUNT SCORING 

Numerical evaluation consists, nominally, in assigning absolute 
numerical grades to pupils’ abilities, school Work, or examination 
Scripts, either on a 0-10, 0-20, percentage, or some other scale. It 
exhibits all the ambiguities of the qualitative grading just described, 
and possesses several additional defects of its own. In other words, 
although it is the most commonly used, it is quite the worst type of 
marking. Unfortunately, since it involves numbers, it is frequently 
confused with the count scores obtained from objective marking 
(Type A). Teachers assume that the label of, say, 15/20 which they 
award to an English composition is equivalent to a score of 15 cor- 
rect arithmetic sums out of 20. Often the two become inextricably 
mixed. For instance, partial credits involving subjective judgments 
of merit may be awarded to partially right sums. Or an examination 
in history, Science, langu 
each of which is marked subjectively out of 20 


subjective impressions of style or general merit. 


! When the group of pupils is a 1 


ү arge one, their grades can be converted into 
rational scores; cf, Dawson (1933), 


pp. 91-3, 
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Distributions of such marks aresometimes smooth and symmetrical, 
but are often wildly irregular or skewed. In an examination where a 
pass mark of, say, 50% has been fixed, it is common to find very 
large numbers of 50's, 51°, and 52's, but по 45's to 49's. Some 
teachers show peculiar idiosyncrasies, such as avoiding certain marks 
altogether. For instance, they may award 9, 8, 6, 5, or 3 out of 10 
Very often, but scarcely ever give 10, 7, 4, 2, or 1. Two illustrations 
may be given of distributions among students or pupils who also 
took an examination which was objectively scored. 

Ex. 5.— Distributions of (Subjective) Marks for Reading and 
(Objective) Marks for Geography. 

Mark Reading F Geography F 


10 24 2 
9 25 3 
8 13 8 
7 6 10 
6 4 11 
3) 2 9 
4 0 9 
3 0 7 
2 0 10 
1 0 4 
0 0 1 

74 74 


The reading marks in Ex. 5 are fairly typical of the distribution 
adopted by teachers in evaluating this subject, or composition, or 
handwriting. Note that this teacher was not, as she supposed, mark- 
ing out of 10, but out of 6, since she only used 6 different grades, 
The fact that about one-third of the class get 10 implies that the 
work of all these children was equally good, which is certainly not 
true. The geography marks awarded by the same teacher were 
derived from a new-type test paper, and so give a roughly normal 
distribution. It is surely obvious that 5/10 for geography does not 
have the same significance, nor represent the same achievement, as 
5/10 for reading. In the one it means approximately the average for 
the class, in the other it means right at the bottom of the class. But 
the children themselves and their parents are likely to assume that 
5/10 always means the same thing. ^ 

Fig. 17 compares the distributions of percentage marks obtained 
by a large group of students on a new-type examination and on 
an ordinary examination of the essay-type. The former, though 
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Fic. 17.—Distributions of percentage marks from (a) an objective examination, 
and (5) a subjectively marked examination. 


(b! 


somewhat irregular, tends towards the normal, symmetrical shape, 
and includes marks ranging from 19% to 86%. The latter is skewed, 
and it ignores the bottom half of he available scale of marks, since 
it ranges only from 50% to 90 D. 

Education authorities would surely distrust a school doctor the 
weights on whose weighing machine varied, say, from 3 Ib. to 1} Ib. 
each. Yet the units implied by the distributions which many of their 
teachers employ are probably quite as variable as this, unless some 
scheme of rationalization is applied (cf. Chap. IV). It is notorious 
also that some faculties in a university use quite different scales of 


marks, or are much more lenient than other fai 


culties. 
It must be obvious that Such marks are not true measurements. 
60% does not represent 


$ 105 Of anything whatsoever. True, it lies a 
certain distance between 0 and 100, but as nobody can define what 


is meant either by zero or perfect ability, that fact does not help 
much. Like the qualitative grades of Type C, they are merely labels 


Whose precise size is fixed more or less vaguely by convention. 
Naturally the convention varies in different institutions. 

Numerical grading, like qualitative grading, actually consists of 
ranking, the scripts in relation to the marker's subjective past ex- 
pertence, Tt differs merely in that number labels are substituted for 
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verbal ones, and that most numerical scales contain a larger number 
of different labels than do qualitative scales. This latter feature leads 
to still another common defect. When an essay or a single examina- 
tion answer has to be marked on a percentage basis, the implication 
is that the marker can distinguish, and keep in mind, a hundred 
different grades of ability (or 51 different grades if he confines himself 
to a range of 40% to 90 76). Actually it has been proved by experi- 
ment that people cannot consistently discriminate more than about 
10 grades, and some would say only 5. 
THE REFORMATION OF MARKING 

-be reformer of marks meets with many objections and 
Most of them either rest upon ignorance of element 
Statistical principles, or else *boil dow 
in Vogue at present are firmly est 
all concerned, and work too well 
such а reactionary attitude is ent 
admittedly many occasions in sch 
of the scales employed matters lit 


The would 
criticisms. ary 
n’ to the view that the systems 
ablished, generally understood by 
and smoothly to be upset. While 
irely unwarranted, yet there are 
ool marking where the rationality 
tle, or where an eccentric distribu- 


nation attempts to pick 
grammar school educatio 
has roughly the opposite 
Pupils who are u 
(1938) points out that the most 
in these two instance. 


quotients to assist in allocating pupil 
School they pore 


д appropriate i 
ultimately attend, р Streams in 


32 THE MEASUREMENT OF ABILITIES 


Next we can distinguish examinations which are constantly set in 
Schools for such pedagogic purposes as ensuring that all, or prac- 
tically all, the pupils have successfully accomplished a certain stage 
in their work, or that they have learnt, say, some French grammar, 
chemical formule, or history notes, set to them for preparation. 
Other exercises are needed, notably in mathematical subjects, for 
practice in the operations which the pupils are in process of acquiring. 
In these instances we may expect distributions to show a strong 
negative skew. The tests will naturally be designed so that the major- 
ity can get full or nearly full marks. The weaklings will be somewhat 
more spread out, but even the poorest may be able to attain half 
marks. Doubtless it is because these abnormal distributions have 
become habitual in classwork that most teachers are so blind to the 
defects in their examination marking. Such tests which are used as 


pedagogic aids should be kept quite distinct from tests or examina- 
tions designed for 


marked according to t 
Standardized (publish 


given to a class or to a group of students 
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pupils, and to yield an approximately normal distribution. Ideally 
the average pupil in the class, or the average examinee, should get as 
near as possible to half marks, the poorest pupil little better than 
zero, and the best pupil little less than full marks. For if no examinee 
Scores less than, say, 40%, then it is obvious that the easy questions 
at the beginning are a waste of everybody’s time. And if no one scores 
more than, say, 70%, the most difficult questions at the end are also 
wasted. 

Humanitarian Objections.—An immediate, and quite pardonable, 
objection will be raised against this ideal, namely, that if marks range 
from nearly 0% to nearly 100%, the poorer pupils will become 
greatly discouraged when they habitually obtain marks of around 
10% to 20%, and their parents will become bitterly resentful. Some 
teachers are so conscious of the stimulating effects of success and 
e depressing effects of failure, that they deliberately distort marks, 
giving the duller pupils who have tried hard more than their work 
deserves, and the brighter ones who have not done so well as they 
might less than they deserve. It may be that this kindheartedness 
of teachers towards the dullards has been partly responsible for 
forcing up their distributions of marks into the absurd shapes 
typified by Ex. 5 above. Perhaps equally potent has been their fear 
Of the criticisms of head teachers and inspectors if they admitted 
their failure to teach the almost ineducable pupils by marking their 
Work on its true merits. Worthy as such humanitarian objections 
тау be, they do not constitute an excuse for bad test construction or 
pad marking. But the solution is not difficult. We shall see in Chap. 
ТУ that the actual average, and range, or the scale of marks, em- 
Ployed do not matter in the slightest, provided that everyone agrees 
to use the same scale. Either, then, the test may contain sufficient 
(useless) questions which are easy enough to encourage the poor 
Pupils; or the test may be rationally constructed, as indicated above, 
30 as to yield a range of, say, 4 to 26 out of 30, or 10 to 70 out of 
80; and when the marking is complete the marks can readily be 
Converted into some more acceptable and conventional scale, such 
аз 40% to 90%, by the methods described in Chap. IV. А third 
alternative is to withhold all numerical grades from pupils 2 
parents, since they are so liable to misinterpretation, and instea ў 
Опсе the scientific marking is done and is recorded for administra- 
tive purposes, to reconsider the scripts and attach any verbal or letter 
labels to them that appear suitable for pupil and parental consumption. 
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Grading of Complex Educational Products.— Certain types of work 
such as English composition and handwriting usually have to be 
graded on the basis of total subjective impression. The same is true 
of many examination scripts when the answers consist of a series of 
historical, scientific, or other essays. Here all the answers to any one 
question should invariably be marked before the examiner proceeds 
to another question. Whenever there are about twenty or more such 
essays (or other educational products) to be dealt with, resort should 
be made to the quality or product scale method (cf. p. 154). b 
marker should first skim through a fair number of the scripts ung 
he has gained a rough impression of their average level and of their 
extreme range of attainment. No account whatever should be taken 
of his feelings that this level is laudably good or deplorably bad. 
What is needed is the actual average of these particular scripts. 
Next he should look through them more carefully and pick out onè 
which typifies the average, set it aside, and call it 5. Next the two 
should be taken which appear to be the best and worst of the bunch; 
these should be called either 9 and 1, or 8 and 2. Finally, others 
should be selected which seem to represent levels intermediate 
between 9 (or 8) and 5, and between 5 апа (ог 2). These should be 
numbered (8), 7, 6, 4, 3, (2). A good deal of revision of first im- 
pressions may be required before a final decision is reached as 10 
these nine (or Seven) samples. Once they are chosen the marking 15 
straightforward. The rest of the scripts are simply compared one by 
one with the samples and each is awarded the mark of the sample 
Which it most closely resembles in merit. Should one or two turn up 
Which are superior to No. 9 or inferior to No. 1, they may be given 
10 or 0. 

One obvious advantage of this plan is that it enables the marker 
to be self-consistent and to maintain the same standards throughout. 
Another is that it does not demand impossibly fine discriminations. 
The number of samples can be reduced to five, or extended to eleven, 
if desired, or some Scripts may be given a mark midway between two 
samples, e.g. 64. But usually seven or nine grades are sufficient. 
Most important, however, is the fact that the final distribution will 
be fairly smooth and symmetrical if the number of Scripts is large 
and the samples have been well chosen. Even if their number is small 
there should be a majority of 6’s, 5’s, and 4’s, and a minority of 9’s, 
8s, 2’s, and 1°; and there should be about as many above as below 
5. The precise marks given to the samples do not matter, They may, 
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for example, equally well be called 90%, 85%, 80% . . . 55% 50%, 
so long as the marker resolutely avoids all temptations to withhold 
90% on the irrelevant grounds that the best scripts in the bunch do 
not seem to him to be ‘up to 90% standard.’ It is, however, probably 
better to do the marking of each question on a 0-10 scale. Such 
marks for several questions in an examination can legitimately be 
added. The final totals can then be converted into any desired con- 
ventional scale, just as can the scores from an objective test. 

Other Types of Marking.—It is more difficult to give definite 
recommendations as to the marking of work of mixed type, where 
only part of the merit or demerit resides in points which can be 
counted. In foreign language examinations, history, science, еќс., 
both correct or incorrect facts, and general style, or originality and 
the like, may need assessment. Sometimes these quantitative and 
qualitative features can be graded separately and later combined in 
due proportions. It might be better if, as is suggested in Chap. XIV, 
such papers were not set ; either they should be definitely new-type, 
or definitely of the qualitative essay-type. Nevertheless, the main 
requisites of good marking should always be attainable, namely a 
similar distribution and similar average level of marks in all subjects, 
or in all the questions set in any one subject, and an approximation 
of this distribution to normality. 

Still more difficult are the problems of the marker who has only a 


few scripts with which to deal, e.g. а university external examiner 
who is called in to assess half a dozen degree candidates. Even if the 


papers consisted of perfectly constructed new-type tests, they would 
inevitably yield extremely irregular and variable distributions. The 
examiner is, therefore, forced to employ subjective recollections of 
the merits of similar candidates whom he has marked in the past. 
Nevertheless, he should keep the normal curve in mind, for in the long 
run all the scripts that he meets are likely to be normally distributed. 
He should try, therefore, to build up definite mental specifications 
of his 9, 8, 7 . . . 2, 1 levels, and attempt, by thorough discussion 
with the internal examiners, to decide where on this scale the First 
Class, Pass, or other standards fall. 

Some additional points bearing on these plans, and the answers 
to some of the possible criticisms, will be discussed when we have 
studied, in the next chapter, the statistical conceptions of percentiles 
averages, and standard deviations. 2 


CHAPTER III 


STATISTICAL METHODS FOR HANDLING 
FREQUENCY DISTRIBUTIONS 


Numerical Description of Distributions —So far we have dealt with 
various types of distributions of measures almost wholly in verbal 
terms, and have compared one distribution with another mainly ‘by 
eye.’ More exact study of them necessitates quantitative description 
and definition. We need first a numerical expression for the ‘general 
run’ of the measures, and, secondly, some index of the range OF 
spread of measures. For instance, if we want to know how intelligent 
Johnny Smith is, and ask a psychologist, he will tell us Johnny $ 
Score on an intelligence test and, in order to enable us to interpret 
this score, he will add that the normal score for children of Johnny 5 
age is so and so, and that Johnny's score falls at such and such a level 
in the distribution. Again, if we ask a teacher what are the heights 
of the children in her class, she might in reply show us a complete 
table of the distribution of heights, or she might tell us with greater 
precision that the average was 50 inches and the range from 42 to 
57 inches. 

Actually the range is not a good index of the way the measures 
are spread out above and below the average, since it depends on only 
two of the measures, those obtained by the extreme pupils who Were 
measured. For example, although this class contains children ranging 
from 42 to 57 inches, all the rest might lie between 46 and 54 inches, 
and we would hardly get a fair picture of the spread if we were only 
told that the total range is 15 inches. A much better index is called 
the Standard Deviation—often abbreviated to S.D. or c, which will 
be described below. 

A distribution of measures can then generally be defined numeri- 
cally by means of its average and standard deviation. But we shall 
describe first an alternative system which possesses certain practical 
advantages. This is based on what are known as the median, quar- 
tiles, and deciles or percentiles. These terms all refer to certain levels 
or fixed points in a distribution. Suppose the measures to be tabu- 
lated or arranged in order, then the median is the middle measure, 
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i.e. the measure half-way up counting from the bottom, or half-way 
down counting from the top. The lower and upper quartiles, usually 
referred to as О, and Оз, are the measures one-quarter and three- 
quarters way up. The decile measures are one-tenth, two-tenths, etc., 
and the percentiles one-hundredth, two-hundredths, etc., of the way 
up. Stated in a different way, the Xth percentile is the measure below 
which X% of all the measures fall. 

Determining the Median.—Ex. 6.—Consider the following distri- 


bution of only five measures. Clearly the median is 15, that is the 
third measure up from the bottom or down from the top. 


N+1 


z 


y be seen that the median is the 


From this example it ma 


measure, not the Neh. 


Ex. 7.—In the following distributions, N = 8, hence N = 1 = 41. 


10 


19 
17 
14 
12 
9 
5) 
4 
1 


©л су Су ~ ~ оо со О 
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between the fourth and fifth measures 
from the bottom. In the first distribution the fourth and fifth meas- 
ures are both 7, hence the median is 7. In the second, the fourth and 
fifth measures are 6 and 7, hence we have to take 6} as the median. 
In the third distribution the data are too sparse for an accurate 
median to be determined. The best we can do is to interpolate and 


call it 105. 

Determining the Quartiles and Percentiles—The quartiles are 
N+1 
ma and 


Thus the median is half-way 


similarly obtained by counting the measures which are 


ы) from the bottom. For the distribution given in Ex. 1 (p. 5) 


— 50, hence the required measures ar 
X Тш, е 124 and 384 up the list. 
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The 11th, 12th, and 13th measures are 9, the 37th to 43rd measures 
are all 17, hence Q, = 9, QS 17. 

Of the percentiles, the most useful are the 100th or 99th, 90th, 
75th, 50th, 25th, 10th, Ist, or Oth. The 100th and Oth are usualy 
the top and bottom measures in the distribution, respectively. The 
75th and 25th are the quartiles, the 50th is the median. The 90th 


percentile or 9th decile is the measure which is 28220 D from the 
bottom. 4 ith 
Determining Percentiles from Tabulated Data.—When dealing wit 
measures which have been grouped into classes, these various levels 
are likely to fall somewhere in the middle of a class. For instance, 1n 
Ex. 3 (p. 10), the median of 250 measures is the 1253th, which lies 
somewhere within the class 98-102. Its position is found by inter- 
polation, and the following formula may be used. 
C/PN 
à Lt Е ( T00 ~ s) 
P — the percentile. у 
X. = the measure falling at this percentile level, which we wish to 
find. ~i н 
L = the lower limit of the class, somewhere within which X, lies. 
S — the sum of all the frequencies up to, but not including, this 
class. 
F — the frequency within this class. 
N — the total of all the frequencies. 
C — the size of the class. 


Ex. 8.—To find the median of the distribution given in Ex. 3, 


substitute in this formula: 43 — 
Р = 50. L, the lower limit, = 98. S = 3 + 10 + 18 + 27 + 
101. 


Е=47.М = 250. С — 5, 


5 (50250 of 
= IP = 100-5 (to the first place 
X, — 98 4 alt us 101) (to the first p 
decimals), 

By applying the same formula, we find the 99th, 90th, 75th, 25th, 


10th, and Ist percentiles to be 128, 117, 109, 94, 86, and 77 respec- 
tively (omitting decimals). 


Comparison of Median and Average.—l1f the distribution is a 
perfectly smooth and normal one, the median is identical with the 
average. In any fairly regular and symmetrical distribution they will 
be quite close to one another. Thus in Ex. 8, where the median 15 
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100-5, the average happens to be 100-68. For rough work the median 
is often used as an approximation to the average, since its computa- 
tion is far easier. Sometimes, indeed, it gives a better notion of the 
‘general run’ of the measures than does the average. For example, 
the average age of a school class may be unduly raised because the 
class includes two or three pupils much older than all the rest. They 
cannot very well be omitted in calculating the average, but if the 
median is found, they no longer exert a disproportionate effect upon 
the result. Similarly, if a free word association test is applied to a 
child at a psychological clinic, most of the child’s responses may be 
given within two to three seconds, but some of them may take much 
longer, and a few words may evoke no response at all. In such a case 
the average cannot be determined, and the median is the most 
reasonable measure of the child’s speed of response. 

Here is another instance where the median may be very useful. 
Suppose an investigator wishes to find the approximate average 
speed with which a group of pupils or students can perform a certain 
task, such as handwriting. There is no need for him to time each 
person separately, nor to count the number of words each one has 
written in a certain time, and then add up and average the results. 
Instead, he may set the whole class writing a standard passage, and 
tell them to hold up their hands directly they finish. Then he may 
count the hands raised, and stop the test as soon as the median 
person raises his hand, noting the time at which this occurs. The 
median speed of writing for a class of, say, 41 persons can be calcu- 
lated from the time taken by the 21st quickest person. 

When a distribution is strongly skewed, the median differs 
appreciably from the average, being larger if the skew is negative, 
smaller if it is positive. In Ex. 1 (p. 14), the median is 14, the average 
12-86. This difference is often used as a measure of skewness (the 
formula can be found in more advanced textbooks). 

Uses of Percentiles.—The percentile levels in a distribution clearly 
provide us with as complete a numerical description of that distri- 
bution as we may need. For example, the figures quoted in Ex. 8 for the 
distribution of students’ scores on the Cattell Intelligence Test (the 
99th, 90th, 75th, 50th, 25th, 10th, and Ist percentiles) give a reason- 


1 The main disadvantage of medians, and also of percentiles, is that they cannot 
readily be manipulated statistically, For instance, given the medians of distribu- 
tions A and B, we cannot determine from them the median of the combined 
distribution, A + B; whereas the averages of A and of B will give us the average 


of A + B. 
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ably full answer to the question—how did the students do on the 
test? Knowing these levels, we can begin to interpret the significance 
of any one person’s score, since we can tell how he stands relative to 
this group of students. Thus a score of 92 on the Cattell test would 
not be a very good one for a training college student. It falls at about 
the 20th percentile, which means that 80% of the students did better 
at the test, only 20% did worse. But when we are told that the same 
score falls at the 94th percentile for normal adults (not students), We 
realize that it does represent quite high ability. The applications of 
percentiles to the interpretation of school marks and to test norms 
will be considered later, 

Inter-quartile Range.—The range of scores from the 25th to the 
75th percentiles (Q, to Оз) is often called the inter-quartile range- 
It tells us what the middle 50% of the group accomplished. Half of 
this range is often referred to as Q, the semi-interquartile range, and 
this is used as a measure of the degree of spread or dispersion of the 
distribution. When the distribution is smooth and normal, Q is 
usually equal to about one-eighth of the total range of measures. 


Е — 94 5 
In Ex. 3 (p. 10) the total range is 61, and Q= mn m 
which is one-eighth of 60. 


CALCULATING THE AVERAGE OR MEAN 

More accurate statistical treatment of distributions demands the 
use of the average and standard deviation in place of the median 
and Q. Everyone knows how to calculate an average or arithmetic 
mean—often referred to as ‘the mean’—but most people use ап 
unnecessarily complex method, which is very liable to lead to 11152 
takes. We will therefore outline this and the shorter alternative 
methods. 

(1) The ordinary method consists in adding up all the measures 


and dividing by the number of cases. Expressed as an algebraic 
formula, M = ER M = the mean or average, X = any one meas- 
ure, and Z = the sum of all the . . . , hence 3X signifies the sum 
of all the separate measures. As a rough general rule, this method 
may be used when N does not amount to more than about 20. With 
larger numbers it becomes very inefficient, unless an adding machine 
is available. 


(2) Often it is quicker to tabulate the measures, to multiply each 
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X by its F, and then to sum. The formula now becomes M = A 
Ex. 9.—The method is here applied to the distribution already 
given in Ex. 1. 


х Е ЕХ 
20 2 40 
19 2 38 
18 3 54 
17 7 119 
16 4 64 
15 6 90 
14 3 42 
13 4 52 
12 2 24 
m 2 22 
10 2 20 
9 3 27 
8 2 16 
1 1 7 
6 2 12 
5 1 5 
4 2 8 
3 0 0 
2 1 2 
1 1 1 
N — 50 643 
м = 58 1286 


(3) The process is shortened still further if the measures are 
tabulated in classes. If x is the midpoint of each class, then M — 
(Fx) 


N 
Ex. 10.—The method is here applied to the distribution given in 


Ех 2; 


х Ба F Fx 
18-20 . 19 7 133 
15-17 . 16 17 272 
12-14 . 13 9 117 

9-11 10 1 70 

6-8 7 5 35 

3-5 4 3 12 

0-2 1 2 2 
50 


Zi 
| 
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2 (Fx) = 641 
641 
M = 57 = 12:82 


Note that this value of M, 12-82, is not precisely identical with 
that obtained by the previous method, 12-86, though the discrepancy 
is only 0-3°/. The reason is, of course, that the figures are somewhat 
distorted by being grouped into classes. We are regarding the 20's 
and the 18's as 19's, the 17's and 15's as 16's, and so on. In the long 
run these distortions tend to cancel one another out, and therefore 
have very little effect upon the final result. They only become serious 
if the grouping is too coarse, i.e. if the number of classes in the table 
is too small. 

(4) The amount of computation, when large numbers are involved. 
may be much reduced by applying the device of the arbitrary ОТ 
guessed mean. 'The following illustration may explain it. Suppose We 
wish to find the average height of a class of children, we could first 
of all measure each child's height from the ground to the top of p 
head, add up the figures and divide by N. An alternative plan, which 
would do equally well, would be to place each child by a 3-foot table, 
and measure the height from this to the top of his head. We would 
average the measures as before, but would add 36 inches to ovr 
final result. Again, we might take a still higher table, say 50 inches; 
and measure from this up to the tops of the heads of the larger 
children, down to the heads of the smaller ones. If we added together 
the former, subtracted the latter, divided by N, and finally added 
50 inches, we should, of course, get exactly the same result as before. 
But the figures with which we should be dealing would be far smaller 
than in the first instance. We would be averaging, e.g. + 6, + 1 
= 2, + 3, — 5, etc., instead of 56, 51, 48, 53, 45, etc. 

The method consists, then, in making a rough guess at the mean; 
listing the amounts by which each measure deviates from (is greater 
or less than) this arbitrary mean, averaging the deviations, and 
finally adding the average to the arbitrary mean. 


_ 2(FD) 
M=2 A. 


A is the arbitrary or guessed mean, and D the difference between 
each X and A. 


Ex. 11.—Тће method is here applied to the distribution given in 
Ex. 9 (p. 41), and it yields precisely the same result for M as was 


р 


= 
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x D F FD 
20 TER 2 14 
19 + 6 2 12 
18 JE S 3 15 
17 A 7 28 
16 3 4 12 
15 tp 6 12 
14 РИ 3 3 
13 0 4 + 96 
DEA cec 2 2 
Li MEMENTO, 2 4 
10. Gps 2 6 
9 A 3 12 
8 a 2 10 
7 = 6 1 6 
6 ТЕ 2 14 
5 сї; 1 8 
4 =® 2 18 
3 == 10) 0 0 
2 E 1 11 
1 12 1 12 

50 — 103 


делу м= #96 103 , 13 = — 0:14 + 13 = 12:86 


obtained by the ordinary method of averaging. Note that in this ex- 
2 159 is a minus quantity, and is therefore subtracted from 


Z (FD 
A, If we had chosen A = 12, x ) would have been a plus 


ample 


quantity namely + 0:86. - 

(5) By far the most efficient method when N is large, say 100 or 
more, is to average, not the actual amounts, but the numbers of 
classes by which the tabulated measures exceed or fall short of A. 
For instance, suppose we have a table of the distribution of ages of 
all the inhabitants of a town, grouped into classes of ten years, with 
midpoints of the classes at 5, 15, 25, etc., years. Wishing to calculate 
the average age we take 35 as our arbitrary mean. Obviously there is 
no need to sum the numbers who are 10, 20, 30, . . . years more or less 
than this 4. We might just as well sum those who are 1, 2, 3,.., 
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decades more or less. Then, when we have calculated the pu 
deviation from A in decades, we must multiply it by 10 to turn 
back into years, before adding it to, or subtracting it from, A. 


The formula for this method is M= aa + A, where d is 


the number of the class above or below that class of which A is the 
midpoint, and c is the size of each class, 


Ex. 12.—The method is here applied to the distribution given in 
Ex. 3 (p. 10). 


X d F Fd 
133-137 +6 1 6 
128-132 +5 1 5 
123-127 4-4 6 24 
118-122 +3 11 33 
113-117 +2 22 44 
108-112 +1 26 26 
103-107 0 35 + 138 

98-102 — 71 47 47 
93-97 => 43 86 
88-92. ~ = 3 27 81 
REPE. dc ee m 18 72 
78-82 —5 10 50 
73-77 —6 3 18 

250 — 354 


The arbitrary mean is taken as the midpoint of the 103-107 сава, 

i.e. as 105. Classes above and below this are numbered consecutively. 
Z (Fd) _ + 138 — 354 — 0:864 

N 250 ier 


This is the average number of classes by which M differs [ош 
© must multiply by 5 to get the average deviation in scores, si 
5. 


ој 


М = 105 — 5 x 0-864 = 100-68 2 
Once the student is accustomed to this method he will find it far 
quicker and more accurate than any other. If in this example he had 
used one of the first three methods, he would have been involved in 
a sum totalling about 25,000. Points which require special caution 
are: (i) making sure what measures are included in a class, and what 


a9 ine 2 Ged) 
is the size of c; (ii) determining A correctly; (iii) mutiplying N 
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by c before adding it to A; (iv) making sure of the sign of this 
quantity before adding it to, or subtracting it from, A. 


THE MEAN VARIATION 
The Mean Variation or Average Deviation (M.V. or A.D.) is 
occasionally employed as an index of the extent to which the meas- 
ures are spread out on either side of the mean. As its name indicates, 
it is the average of all the differences between each measure and the 
mean (or sometimes the median), regardless of whether these differ- 


ences are plus or minus. 

Ex. 13.—In this very brief distribution, the mean is 5:6. The D - 
column lists the deviations. M.V. — SNP EE 2:08. If the data were 
in the form of a frequency distribution table, the same method would 


apply, but the formula would be M.V. — 


x D 

9 34 

7 1-4 

6 0-4 

4 1:6 

2 3:6 
М№= 5 10:4 
М = 5:6 


Clearly the М.У. will Бе larger the тоге widely dispersed ог 
spread out the measures. Its value is a little greater than that of Q, 
being generally about one-seventh of the total range of measures, 
(This does not hold in Ex. 13 owing to the shortness and irregularity 
of the distribution.) The M.V. happens to be of very little use for 
purposes of further statistical treatment, hence it is usually passed 


over in favour of the Standard Deviation. 


THE STANDARD DEVIATION 
The S.D. is also a kind of average of all the deviations from the 
mean, but it differs from the M.V. in that the deviations are squared 
before they are summed, and the square root of their average is 


taken. 
S.D. or o = qe 
N 
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Ex. 14.—The working is here shown for the same short distribu- 
tion as in Ex. 13. 
E 1756 

1-96 ZD? 2920. 5.84 

0-16 Мате 

2:56 
12:96 


29-20 


| ioo 
онон 
Adh AAU 


N= 5 
M = 5.6 HORS 
As the mean is seldom a whole number, the squaring of deviati i 
from it is troublesome. Generally it is simpler to list the devia A 
from some arbitrary mean which is a whole number, and then 
apply a correction according to the following formula: 
D2 
e- [EE Mae 
H H . . i i n 
Note that the correction is invariably subtracted, whether its 512 
before squaring is positive or negative. 


le 
Ex. 15.—In the same distribution, let A = 6, i.e. the nearest who 
number to M. 


9 3 H 30 
9 D? 
Pel ү азер а = 50 
5 0 Д (М — A)? = (5:6 — 6)? = 0:16 
ЭМУ ENS o? = 60 — 0-16 = 5:84 
eee а = 2:42 
30 


The same formula may be extended to several different types? 
distributions. 


re 
(a) When the distribution is a short one, and all the DAE ae 
whole numbers, A may be chosen at zero. The D’s will then be 


original X's, so that: 4 
ву == / XS M2 
= N- 


Ex. 16.— 
$ * ZX? 186 
5937: 2 = 31:36 
7 49 eser 372 M 
6 3 20370) 6 31:36 = 5:84 
le AO toe i S 
Э 4 Са 


‘| 
| 
| 


ES e 


— wa 
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This method is normally applied when a calculating machine is 
available, since the machine is not—like the human mind—liable to 
errors when coping with large numbers. 

(b) Often the mean is not known, but is calculated at the same 
, = 
time as the S.D. Now M = N + A. Hence an alternative version 
of the same formula is: 


„= JR: GR) 


(c) When the measures are grouped into classes, the formula 


becomes: 
„Гао (ARDY 


Here c is the size of each class, and d the deviation of each measure 
from an arbitrary or guessed mean in terms of number of classes. 


Ex. 17.—The formula is here applied to the distribution whose 


mean was calculated in Ex. 12. It was found there that 2XFa) = 
N 
— 0-864. Hence the correction =”) = 0:746. 
X F d d? Fd? 
133 4- 1 4-6 36 36 
128 + 1 +5 25 25 
123 4- 6 +4 16 . 96 
118 4- 11 +3 9 99 
113 + 22 +2 4 88 
108 -+ 26 +1 1 26 
103 + 35 0 0 0 
98 -- 47 ail 1 47 
93 +- 45 —2 4 172 
88 -+ 27 =8 9 243 
83 + 18 —4 16 288 
78 + 10 == 8) 25 250 
73 4- 8 —6 36 108 
250 1,478 
204) _ 18 — 5.912 


INDE 250 
а = 54/5912 — 07746 = 11-36 
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One more instance will be given, to illustrate the calculation of the 
mean and S.D. in one operation. 


, EX. 18.—The following distribution of school marks is grouped 
into classes of two. Take A at the 11-12 class, i.e. at 11:5. 


X F d Fd Fd? 
19-20 2 +4 8 32 
17-18 5 +3 15 45 
15-16 4 rp) 8 16 
13-14 4 +1 4 4 
11-12 3 0 EE 0 
9-10 7 DIU 7 
7-8 4 =? 8 16 
5-6 4 = 12 36 
3-4 2 = 8 32 
1-2 | — 5 5 25 
М = 36 CY 505 
I(Fd) —5 
рате == = 0-139 
This is in terms of classes. In terms of marks, 
Z(FD) | е # 
CENTS 0-278 
Thus M = 11-5 — 0:278 = 11:22. 
(Fd)? 213 _ ZXFd)N _ 
mu adis el (22) = oote 


а = 24/5:917 — 0-019 = 4-858 


These calculations are quite quickly and easily accomplished with 
a little practice, though some care is needed in checking the arith- 
metic. For example, the whole of the above calculation, using 4- 
figure logarithms and checking by slide-rule, occupied the writer less 
than four minutes. 

The S.D. is always greater than Q or the M.V. This is to be expec- 
ted, since any big deviations from the mean receive much greater 
weight in the S.D. than in the M.V. by being squared before they are 
averaged. It generally amounts to one-fifth or one-sixth of the total 
range, or sometimes more if the distribution is irregular, as in Ex. 18. 
This rule enables us to apply a rough check to our result. If in Ex. 18 
we had obtained с = 2 оге = 10, we should have known that there 
had been some mistake in the computation. 
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The Significance of the S.D.—It was shown earlier in this chapter 
that any distribution of scores could be fairly completely defined or 
numerically described provided we knew some of the main percentile 
levels. Now if a distribution approximates to the normal shape (as it 
should do when obtained from а well-constructed test or examination), 
it can be more simply specified merely by finding the mean score and 
the standard deviation. The reason for this is that the normal curve 
is a regular shape which, like the straight line or the parabola, has 
its own algebraic equation. For instance: y = kx gives a straight line, 
running through the origin, whose slope depends on the size of К. A 
normal curve’s equation is more complicated, but actually involves 
nothing more than x, y, о and certain constants: 

m 

x refers to any given score expressed as a deviation from the mean, and 
y is the frequency of that score, expressed as a proportion of the total 
number of cases. 

This means that, if we know the S.D. of a distribution, we can 
deduce y for every value of x, that is the frequency of every possible 
measure. For instance, although we cannot measure the intelligence 
quotients of the whole population of Great Britain, yet, if we may 
assume that the I.Q. is a variable which is normally distributed, with 
a mean of 100 and a S.D. of 15, we can state at once just what pro- 
portions of the population have 1.0.5 of 100, 101, . . . etc., or what 
proportion lies below 67 1.0., or between 120 and 130 I.Q., and so on. 

Furthermore, there are available in most statistical textbooks tables 


which will tell us the value of ycorresponding to each value of х, regard. 


less of what the variable may be (I.Q., height, handwriting speed,etc.)- 
These are known as “probability integral’ tables. The graph given in 
Fig. 18 is based on such tables, but is probably easier to use, and is 


accuri 
—the actual proportion of measures which deviate from the mean 


by this amount—but the proportion which deviate by this amount or 
more. Consider for example 1.0. 115. Here x is + 15, and if the S.D. 


is 15, then X _ 1. That is, 115 represents a deviation of 1 x c above 
g 


T T MADE 
ate to two or three significant places. Against = is plotted, not y 


the mean. Now the lowest curve in Fig. 18 tells us that when% = 1 
= 5 
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У = 0-16. In other words, about 0-16 or 16% of the рор 
would be expected to have 1.Q.s of 115 upwards. Similarly d 
16% have I.Q.s of 85 and below, since 85 represents a deris 
—1 x ofrom the mean. If we enquire how many 1.Q.s there are of 130 


x : · 0: r 
upwards, = = +2, and the second curve gives the answer 0:023, 0 


about 2-3%. Note that the same curves can be used for uer 
deductions about any other normally distributed variable, provi ЫР 
allowance is made for the size of the S.D. The upper curyes are 


ч ч à f 
dealing more accurately with larger values of * and smaller values © 
с 


А А * 2 " ion of 
У. Each curve 15, as it were, a 10 times magnification of a section 
the curve below. 


‚ Ex. 19.—In a certain area, with a school population of 25,000, 25 
intelligence test with a S.D. of 164 was applied. How many children 
may be expected to have I.Q.s of 120 to 130? And what will be t 
highest I.Q. among the least intelligent 1,000 of these children? 

Intelligence is something which varies continuously ; that is to SN 
every possible value of I.Q. in between 119 and 120 can DUE 
although we usually calculate it only to the nearest whole E 
Strictly speaking, then, we want to know how many children 1 av 
1.0.5 between 119-50 and 130-50, and to exclude those whose 1. a 
are 119-49 or less (who would count as 1197$), also those os 
L.Q:s are 130-51 or more (who would count as 1317$). Now 1 zd 
and 130:5 differ from the mean by 19:5 and 30:5, that is by 1-180 a! 
1:850, taking c as 163. Reading off from Fig. 18, the corresponding 
proportions are 0-119 and 0:0322. Hence the proportion о es 
population within the given limits is 0-119 — 0-0322 — 0:0868 = 
8-68 7. 8-68% of 25,000 is 2,170, the answer to our first question. 

In the second question, 1,000 out of 25,000 is a proporuo n 
0-040. From Fig. 18, 0-04 corresponds to an x value of 1:7 Rd 
1-75 x 163 = 28:875. The lowest thousand therefore have rR је 
100 — 28-875, or less. Hence the answer, to the nearest Who X 
number of I.Q., is 71. х 5 

Ex. 20.—Does the distribution of intelligence test scores given 2: 
Ex. 17 (p. 47) conform to a normal distribution? In other words, a 
the obtained frequencies similar to the frequencies which would be 
expected for a truly normal distribution with the same mean (100-68) 
and the same S.D. (11-36)? 

These test scores, unlike I.Q.s, are discrete. Nobody could score 
1073, however carefully his score was computed. If he nearly, but 
not quite, completed 108 items, his score would still be 107. Thus 
in order to find what proportion may be expected (according to the 
normal curve) to fall within the 103-107 class, we must determine 
the proportion with scores of 103 or more and subtract from it the 


|| | 1 1 П t 
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proportion with scores of 108 or more, vot the proportions lyi 

between 102-5 and 107-5. The calculations for ADS c Ses 
in the following table. For example, 103 is a deviation of + 2:32 
from the mean, or + 0:20с; 108 deviates -+ 7-32 ог + 0:65с from 
the mean. From Fig. 18 we find the proportions lying at or beyond 
these two limits to be 0:421 and 0-258 respectively. Hence the pro- 
portion to be expected within the 103-107 class is 0:421 — 0:258 = 
0-163. N is 250, and 0:163 x 250 — 41. The penultimate column 
lists the expected frequencies (omitting decimals), and the last 
column gives the frequencies actually obtained. The two columns 
agree fairly closely, but there are some discrepancies. At present we 
lack a method for telling whether these discrepancies should be 
regarded as appreciable or important, but such a method will be 
described later (pp. 90-1). The full answer to this question will 


therefore be postponed. 
Proportions Fex- Fact- 


Proportions ) oe К 

х ~ o Qeyond these GERI EET S 

values of хс class class tained 
133 + 3232 + 42:85 000219 000219 1 1 
1284 27-324 +2:41  0:0078 0.00561 1 1 
1234  .L22.32-4- +1-97 0:025 0:0172 à 1$ 
1184+ 211732: -- 1:53 0:062 0:037 9 1l 
113--  --12:324- +109 0:138 0-076 D H 
108-- + 732-- + 0:65 0:258 0-120 30 26 
1034 232+ 4-020 0-421 0-163 4l 35 
98-- — 268-- — 0:24 0:404 0:175 44 47 
93-L. — 7:68-+ — 0:68 0:246 0-158 40 43 
Rat! 2 To Co О 0:133 0-113 28 27 
834 — 1768--  — 1:56 0:059 0:074 18 18 
784 — 22:68 +. — 2:00. 0:023 0:036 9 10 
73--  — 27:68--  — 2:44 0:0073 0:0157 Are 53 
68-- — 3268- — 2:88 0:00195 0:0073 2» 0 


1.00000 250 250 


On examining Fig. 18, we see that the proportions whose scores 
deviate by 2:5c or more are very small, less than 1 in 100, and that 
those who deviate by 3c or more only number about 1 in 1,000. 
Here, then, we have the reason for a statement made earlier, namely 
that the S.D. usually amounts to one-fifth or one-sixth of the total 
range of a distribution. For when N is about 100, and the distribution 
is a regular one, we shall ordinarily find the measures ranging from 
+ 2:5e to — 2:50, i.e. a total of 5o. And when N approaches 1,000 
we may find scores running from + За to — За, a total of бо. 

There is yet another way of interpreting normal distributions, 
which will be useful to us later, that is in terms of probabilities. 
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Since 0-159 of persons may be expected to obtain measures lo ог 
more above the mean, the probability that any one person will 
measure this much or more is 0-159. The probability that he will 
measure less is 0-841. In other words, the odds are 841 to 159, or 
about 5} to 1 against his measure being as high as or higher than 
this. Similarly the probability of a measure as high as + 3e or more 
is 0-00135, and the chances against it are 740 to 1. This means, for 
example, that we should probably find only one child in an ordinary 
elementary school of about 740 pupils with an І.О. of 145 or more 
if we assume the I.Q.s of school children to be normally distributed 
with а с of 15, since 145 = 100 + 3 x 15. But as many tests have 
considerably larger S.D.s, they will yield a greater number of high 
1.Q.s, and the probability of coming across such children will be in- 
creased. On the other hand the range of intelligence in any One 
primary school is often limited, and ould make the odds 
against high I.Q.s much stronger. Thus if о is 124, then 150 is 4c 
above the mean, and we can deduce from Fig. 18 that only about 1 
in 31,000 (in schools with an average I.Q. 100) would reach this level. 


CHAPTER IV 


INTERPRETATION OF SCHOOL MARKS AND 
TEST SCORES 


Combining or Comparing Distributions of Marks.—In Chap. II we 
found that the marks awarded to school work or examinations are 
often so erratically distributed that it is impossible to interpret them, 
as they stand, correctly. The first stages in the interpretation of 
marks involve combining and comparing them. For instance, the 
marks for different questions in one examination, or different exami- 
nations in one subject, need to be combined to give a total score. 
Often the marks on several subjects are combined to yield a general 
view of the pupil’s or student’s all-round ability. Comparison enters 
when we wish to know whether a pupil or student is better on one 
subject than on some other subject (subjects which may have been 
marked either by the same, or by different, examiners or teachers). 
In large-scale examinations where several examiners each deal with a 
proportion of the scripts it is, of course, essential that their markings 
should be comparable. Sometimes, also, age allowances must be 
made; the marks of younger and older pupils need to be rendered 
comparable by the elimination of differences due to age. 

Now, none of these processes can be carried out directly unless 
the marks to be combined or compared occur in closely similar 
frequency distributions. This is a fundamental statistical principle, 
very elementary in nature, yet very widely neglected. But with the 
aid of the statistical tools described in Chap. III, we are now in a 
position to determine whether such distributions are sufficiently 
similar, and if they are not, to adjust them in such a way as to make 
them so. These tools will further assist in interpretation, since they 
provide a scientific definition of those exceedingly vague conceptions 
—the goodness or badness of a mark or test score, and the ease or 
difficulty of an examination question or fest item. 

Distributions may be dissimilar from one another in three respects : 
(1) differences in their averages arise, for example, when one exam- 
iner is far more lenient than another; (2) differences in their disper- 
sions may arise when one examiner spreads out his marks far more 

M.A.—5 
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widely above and below the average than another; (3) See 
occur in their shapes, e.g. normal, skew, rectangular, etc. For me 
time being we will consider only the most important sca 
approximately normal. We may then state that marks can е ше fe 
combined or compared only if drawn from distributions wit 
same mean and the same S.D. cone 
The Importance of Equivalent Standard Deviations.—ldentity > 
S.D.s is the more important of these two requirements, though it is 
less often recognized. In many instances of combination of marks, 
the ayerages do not matter in the slightest. 


Ex. 21.—Suppose that the final marks of a group of pupils or 
students de 


pend on their results in three subjects, A, B; and C, and 
that the distributions are as follows: 


Subject Mean 


Range S.D. 
A. . 60 40-80 И. 
Вие ~ + 50 30-70 7 
CiM . 80 60-100 y 


Now, a pupil who is average on all three subjects obtains a total 
of 190/300, and 


another who is average on two subjects but top on 
the third scores 210/300 whatever the subject. 


Ex. 22.—In contrast to Ex. 21, let us now combine the following 
sets of marks, whose S. 


D.s differ. 

Subject Mean Range р: 
A. - 60 40-80 xi 
ВИ: - 50, 10-90 14 
Сг . 80 70-90 34 


A pupil who is average all round again scores 190. But now a 
pupil who is average in > 
230, and 200 according ject is A, B, or C respectively. 
The fact of doing well i 
actually raises the sco 
And doing well in C has the least effect in raising the final total 
(although it has the hi 
Similarly if we compare the performances ofithree pupils who are 
at the lower quartile г 
we find their respecti 
on B has a big effec 

The general princi 
Several sets of mark: 
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weight to written as to mental arithmetic, we shall accomplish it, 
not by making the average mark on the former twice the average on 
the latter, but by making the range and S.D. of the former twice as 
large. The usual procedure in such weighting is to multiply the for- 
mer set of marks by two. This may produce the desired effect, not 
because it doubles the average, but because it doubles the S.D. The 
same effect would follow if the average remained unchanged while 
the deviation of each pupil's mark from this average was doubled. 
When the marks derived from several questions in a single examina- 
tion are to be combined in certain proportions, this same principle of 
adjusting the S.D.s for the questions must be borne in mind. 

In order to combine or compare two sets of marks whose distribu- 
tions are dissimilar, we should then express each set in the form of 
deviations from their own means, and multiply one set by a figure 
which will raise or lower its S.D. to the same level as that of the other 
set. For instance, in the two sets B and C in Ex. 21, the ratio of S.D.s 
is 14 to 34, or 4 to 1. The deviations in C will therefore become 
equivalent to those in B if multiplied by 4. Alternatively, each mark 


may be expressed in = form, The top marks оп А, В, and C thus be- 


waste GOONS: 39 siglo О ПО сс equal + 2-86. 
TG WE 34 
When so converted, they are all equivalent. 

Scores which are expressed in this form are referred to as standard 
scores, or o scores, or sometimes z-scores. As we shall see below, 
they are frequently adopted in scoring mental tests. For example, the 
vocational tester may wish to know whether a man’s performance 
on a mechanical test is relatively higher or lower than on a clerical or 
some other test. The tests are likely to have different means and S.D.s, 
but the man’s standard scores can legitimately be compared. 

The Importance of Equivalent Means.—Naturally the averages of 
distributions are important under some circumstances. If marks on 
several questions or several subjects are to be combined, and alterna- 
tives are allowed, so that each pupil's final mark is derived from a 
selected number of questions or subjects, then it is quite essential to 
ensure equivalent means. Thus in Exs. 21 and 22, if pupils took either 
A. and B, or A and C, those who were at the average level would total 
110 on the former combination, 140 on the latter. Again, when scores 
are to be, not combined, but compared, it is obvious that differences 
in means will entirely upset the comparisons. The adjustment of 
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discrepancies in means is easy—merely add or subtract so much from 
all the marks in one subject or question until its average is the same 
as that for the other subject or question. More often than not, how- 
ever, discrepancies in means are accompanied by discrepancies in 
S.D.s. In that case resort must be made to the technique already de- 


i | х c 
scribed, according to which each mark is expressed as =, or to som 


modification of this technique (cf. pp. 59-62). 


EQUATING OFjMARKS BY THE PERGENTILE 
TECHNIQUE 


Although the statistical treatment of distributions in terms of 
means and S.D.s is far from difficult, it may admittedly lie outside 
the scope of the ordinary teacher or examiner. Further, it cannot. be 
used when the distributions are clearly non-normal. For instance, if a 
headmaster, to whom were submitted the marks listed in Ex. 5 (р. 29) 
wished to compare the pupils' standings on reading and geography; 
he could not use this technique. The alternative, percentile, system 
would, however, be quite suitable for making. such a comparison, 
and it is probably simpler for the uninitiated to comprehend. (It does 
not even involve looking up squares and square roots, as does à 
technique based on the S.D.) : 

Percentiles will provide a uniform scale in terms of which compari- 
Sons may be made between any sets of marks. Whatever the test or 
examination which a group of pupils or students takes, a certain 
percentile level always means the same level of achievement (relative 
to that group of. people). For instance, on looking at Ex. 5 we might 
be inclined to suppose that 8 marks for reading represents a better 
achievement than 3 marks for geography; certainly the pupils them- 
selves would infer that this was so. But actually they are precisely 
equivalent, since both marks fall at the 25th percentile, Q,. The 
teacher may, of course, assert that the general standard of reading in 
that class is good and the geography poor. This is an objection which 
- will frequently be raised by critics of this chapter, and we will show 

later why we believe it to be baseless. 
Percentile Graphs.—One of the best ways of comparing two sets of 


marks is by a 7-point percentile graph, which is illustrated in the 
following example. 
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Frequencies 

Mark English Arithmetic 
90 + aye od 
80 + 10 11 
70 + 15 13 
60 + 17 15 
50 + 8 10 
40 -+ 5 2 
30 + 1 0 
20 + 0 1 

57 57 


Ex. 23.—The table shows the marks of 57 pupils, aged 12 +, in 
a Scottish school on final examinations in English and arithmetic. 
Though the tabulation is coarse, it indicates that the two distribu- 
tions differ in their means and S.D.s, and that, while the first is 
roughly normal, the second is negatively skewed. The procedure is, 
first, to determine the 99th, 90th, 75th, 50th, 25th, 10th, and Ist 
percentiles for each set of marks. From the original lists these are 
found to be 90, 84, 77, 67, 59, 49, 39, and 97, 89, 81, 71, 60, 53, 35, 
respectively. Draw a graph, as shown in Fig. 19, where English 
marks are plotted along the Y axis against arithmetic marks on the 
X axis. Plot these seven points (X — 97, Y = 90; X = 89, Y = 84; 
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Fic. 19.—Comparison of two sets of school marks by means of a seven-point 

graph. 
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сіс.) Мапа connect them up by a straight line or smooth curve we P 
runs as nearly as possible through them all. This line шау Pon 
tended at both ends so as to cover the total ranges of marks. Sod 
it we can now read off the mark on the English scale which са пс. 
sponds to each mark on ће arithmetic scale, and vice ME to 85 
4l in English is equivalent to 39 in arithmetic, 80 in Englis 

in arithmetic. 


: ; їп 

Often it is not necessary to determine as many points as AC 

order to fix the graph. Three points, preferably the 90th, 50t “үз 
10th percentiles, may give sufficient accuracy. Note that jus 


Same procedure could have been used even if the distributions had 
differed widely in range and dispersion, e.g. if one subject had been 
marked out of 40, the other out of 200.2 

Interpretation of Marks by Means of Percentile Levels.—The 
results of 


d Е х ional 
many standardized tests of intelligence and а 
aptitudes are quoted in terms of percentiles, since the origina 

Taw scores have little meaning in themselves; for example—percen 


tiles relative to 1 3-year modern pupils, or percentiles relative to R.A.F. 
recruits, and so on, It has someti 


mes been suggested that the same 
system should be used for school or university examinations. The 
original marks would not be published at all. Average achievement 
would always be Iepresented by 50, and very high or low achievement 
by marks approaching 100 and 0. Though this would be fairer insome 
ways, it would be di 


ficult for pupils, students, or parents to grasp 
that these were always r 


elative to the group which had taken that 
examination. Percentiles are, in fact, precisely equivalent to rank 
orders, except that they always run from 100 to 0 instead of from 1 
to N. The scheme would also not be feasible for examinations taken 
by small numbers, say less than twenty students (cf. p. 66). Never- 
theless it would be a considerable step forward if the mark lists for 
- large-scale examinations, at universities, training centres or second- 
ary schools, included a table of the median, quartiles and some of 
the other percentiles for the marks which had actually been awarded. 
"Teachers and students would soon learn the significance of these 
terms, and would then be able to see what standards of marking the 
various examiners were adopting. They could also obtain a much 
better idea of r 


elative marks in different subjects than from the pre- 
Sent supposedly absolute mar 


ks, which actually vary widely from 
one examination to another. 


* Further discussion of this topic may be found in Mcintosh ег al. (1949), 
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Defects of Percentiles.—Although percentiles are invaluable for 
purposes of comparing distributions, they may be very misleading if 
used for combining distributions. The reason is that percentile units 
are very far from equivalent to one another, since they are far closer 
together at or near the median than they are at or near the extremes. 
It can be seen from the probability ‘integral’ graph (Fig. 18) that a 
percentile level of 90, i.e. the measure above which only 10% of the 


measures lie, 
80th similarly corresponds to + 0-84, the 60th and 50th to + 0:25 
and 0-00. Thus the difference between the first two of these, 0-44, is 
decidedly greater than that between the second pair, 0:25. Actually 
the 99th, 94th, 78th, 50th, 22nd, 6th, and Ist percentiles are all approxi- 
mately equally spaced, in the sense that a child who is at the 99th 
percentile is as superior to one at the 94th, as the 94th is to the 78th, 
or the 78th to the 50th. The consequence of this inequality of units 
is that, although we may, if we wish, average or otherwise combine 
a person's percentiles on different tests, we shall obtain thereby quite 


different results from those obtained by averaging his scores. 

Ex. 24.—Pupil A's marks on two examinations lie at the 98th and 
54th percentiles; Pupil B’s marks on both examinations lie at the 
85th percentile. A's average percentile of 76 is distinctly poorer than 


corresponds to an х or standard score of + 1:28. The 


ea ep? ee (Be, А 
B’s 85. But if these figures are converted into = scores, 1.е. into units 


which really are equivalent, it will be found that A is actually slightly 


better than B. For the = values of the 98th and 54th percentiles are 


+ 2:054 and + 0-100, average + 1:077; whereas the х value of the 
85th percentile is only + 1:036. 


It is advisable, therefore, never to use percentiles for combining 
marks. The same is true of rank orders (cf. p. 25-6). 

Adoption of a Standard Scale for All School or Examination Marks. 
— The following solution to this difficulty may be recommended. 
Each chief examiner, headmaster, or other authority who is con- 
cerned with combining, comparing, and interpreting sets of marks 
awarded by sub-examiners, class teachers, etc., should adopt an 
arbitrary standard scale. He might, for instance, choose a percen- 
tage scale ranging roughly from 30% to 90%, with a mean of 60% 
and a S.D. of 10. Now, on such a scale the 7 percentile points 
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— Conversion of two sets of school marks into a standard scale. 


mentioned above fall at 83, 73, 67, 60, 53, 47, and 37 respectively. Esc 
Set of marks submitted by a teacher or examiner should be converte: 
into this scale by the graphical method already described. If this is 


done, all the marks from all the markers will be strictly comparable 
with one another, 


and they will be expressed in units which are 
(sufficiently closely for all practical purposes) equivalent to one an- 
other, so that they can be combined or averaged at will. And there is 
the further advantage that the scale corresponds well to the present 
conventional conception of high and low percentages. Naturally any 
other standard scale can be chosen if preferred, and its percentile 
points determined from the probability integral graph. Fig. 20 shows 
the graphs for transforming the two distributions of Ex. 23 into the 
Suggested scale. 97 for arithmetic and 90 for English are plotted 
against 83 (the 99th percentile point) on the standard scale, and so 
on. The standard mark for any other mark can be read off. Thus 81% 


H5 The values of the 99th and 1st Percentiles are + and — 2-326 respectively, 
9f the 90th and 10th they are + and — 1:282, of the 75th and 25th they are + 
and — 0:675, 


DOn 
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in arithmetic becomes 67% standard, and 81% in English becomes 
71% standard. 

Another way of achieving the same result is as follows. By apply- 
ing the method described in Ex. 20 (p. 50), a table may be drawn up 
showing the numbers of cases to be expected within each class- 
interval on the desired standard scale. Thus for a scale with a mean 
of 60 and S.D. 10, the following percentages of pupils should fall 
within successive class-intervals of 5 marks. 


Ех. 25— 

Mark Е% 
88-92 0 ort 
83-87 1 
78-82 3 
73-77 64 
68-72 12 
63-67 17 
58-62 20 
53-57 17 
48-52 12 
43-47 64 
38-42 S 
33-37 1 
28-32 0 or 4 

100 


Teachers or sub-examiners may be giyen such a table and in- 
structed to award marks to their pupils roughly in accordance with 
the frequencies there shown (precise conformity is not necessary). 
Or, alternatively, the chief examiner or head teacher may take each of 
the distributions of unscaled marks, award a scaled mark between 
88 and 92 to the top 0 or 4% of the examinees, marks between 83 and 
87 to the next 1% of them, and so on. Only very exceptionally good 
or poor candidates—those better or worse than 999 in 1,000 of the 
ordinary run—should be given marks of over 90% or under 30%. 
For ina normal distribution only a proportion of 0-135 9/ is expected 


x x. 
to score more than -+ 35 or less than — 3. 


We 
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Ех Aa 


Frequency 
(a) (b) 


Exceptional 
cases only 


Use 
OSB] 


Ее 
555 ас чос оо 
96 I9 UG еол OY 3 


6 3 
12 5 
18 Т 
20 8 
18 7 
8 12 5 
7 6 3 
6 3 
5 1 ) 1 
4 
3 2 Exceptional 
2 | cases only 
1 
0 
100 40 


Burt (1917) recomme 
based on a similar 
range of 0-20. Th 


OBJECTIONS TO RATIONALIZED MARKING 
There is no doubt that 


——— = 
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in one paper, since some questions are “done better’ than others. 
Teachers similarly regard their classes as “better at’ some subjects 
than at others, and so deserving of higher marks. Again, the class 
they have this year may seem to them an exceptionally “good” or 
‘bad’ one, and the compositions written by the class may be ‘better 
or poorer than usual.’ Finally, those mathematical lecturers and 
teachers who are particularly prone to award very widely dispersed 
marks state that their students exhibit a ‘wider range of accomplish- 
ment? than do students of other subjects. 

Justification for Equalizing the Average in All Sets of Marks.— 
Many of these claims are extremely dubious. Thus it is more likely 
that the university departments which award high marks will attract 
a poorer, rather than a better, run of students, and that the subjects 
which are marked more strictly will be pursued chiefly by the better 
students who care more for the subjects than for marks. Again, one 
is justified in suspecting teachers’ judgments about the merits of their 
classes, since some teachers appear, year after year, to be blessed 
with good classes, and others to be cursed with bad ones. According 
to what is popularly known as ‘the law of averages,’ one would 
expect the number of good pupils to be balanced in the long run by 
an equal number of bad ones. As a matter of fact, the statistical 
techniques outlined in Chap. V give us a method of predicting just 
how much classes are likely to vary from year to year, or English 
compositions from week to week, owing to so-called fluctuations of 
sampling. Here is one instance. A class of 45 pupils averages 60% in 
their general scholastic work on a scale of marks running roughly 
from 30% to 90%. From these data it may be stated that there is an 
even chance that the average level of next year’s class will lie between 
59% and 61%, and that it is extremely improbable that next year’s 
will differ from this year’s by more than 3% or 4% up or down. 
Consider also the marking of English compositions, or of half a 
dozen or more separate questions in an examination paper. If the 
scale of marks employed ranges roughly from 4 to 20 (S.D. 3), then 
we can predict that the maximum variation likely to occur in the 
average marks for different compositions or questions is 3 marks, 
provided that there are at least 36 scripts, and only 2 marks if there 
are 81 scripts. An individual pupil may certainly range very widely in 
his level of performance on different questions, but the average level 
of a group of pupils is far more stable than is commonly supposed. 
The apparent variations in the goodness of answers are due much 
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less to variations in the pupils than to irregularities in the suitability 
of the questions or of the topics for compositions. But if an свати 
sets а bad question which is beyond the scope of the majority È 
examinees, or which does not allow them to express their fulles 
capacities, there is no justification for giving low marks. It is there- 
fore much fairer to give the same average mark to the actual average 
script in all cases, as was suggested in Chap. II. 1 

Itis a different matter when а group of pupils or students is known, 
from objective evidence, to be selected in such a way as to be, on the 
average, superior or inferior to the mean of some larger group. A 
special class for scholarship candidates and one for backward pupils 
will naturally show generally good and poor attainments respectively. 
But in many cases the teacher or examiner who claims to have a 
specially good or poor set of pupils or scripts is judging purely on 
the basis of subjective recollections of what he regards as average 
level. When a test or examination is applied to all the pupils of a 
certain age in a School, or in several schools, then the whole group 
should be marked according to the principles already outlined, and 
the marks for separate classes within this wider group may be picked 
Out and averaged. Such averages will then quite justifiably show а 
certain range of variation. But when a test or examination is given 
only to a single class, it is far safer not to trust the marker's subjective 
view of the class’s general level, but to award an average mark to the 
average script actually obtained from this class. Similarly the aver- 
ages for different school subjects should be the same, regardless of 
whether a class appears to be better at one subject than at another, 
except when objective evidence (derived from the application of the 
Same tests or examinations to other. larger, groups of pupils) proves 
that the class's attainments are uneven. 

The *Goodness or *Badness' of a Mark. 
wider question : what do teachers and examiners mean when they say 
that a pupil's work is good or bad, or an examination paper easy or 
difficult? For endless objections of the type discussed above will be 
raised until this is settled. Most persons who have marking to do 
Obviously believe that they possess in their minds some absolute 
Standard of goodness and badness 


by means of which they can 
evaluate a pupil’s or a class’s achievement. But numerous experi- 
ments have proved that such subjective judgments of merit or diffi- 
culty, when obtained from several different persons, are often widely 
discrepant. It was for this reason that we were forced to conclude, in 


—We still have to face the 


—À ÓÀ 
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Chap. II, that all marking consists primarily in ranking a set of 
performances in order, and that itis a matter of relative comparison, 
not of absolute placement. The goodness of a child's performance at 
some task can therefore mean only that he ranks high when his 
performance is directly compared with the performances of other 
children at the same task, under similar circumstances. An achieve- 
ment is good if only very few people can do as well or better, and it 
is poor if only very few people do it as badly or worse. Similarly, the 
rational definition of the difficulty of a task is the fewness of people 


who can accomplish this task. 
*Fewness of people,’ in statistica 


percentile level or, preferably, Х value. A mark is not necessarily a 
o 


good one because it happens to be 80% or 9/10 and the like. But it is 
good relative to the marks of some particular group of persons if it 
falls at the 95th percentile, or is 2c above the mean mark obtained 
by these persons, or if it stands high on a standard scale whose mean 
and S.D. are known. Such designations of goodness or badness are 


entirely unequivocal. 

The Ease or Difficulty of Questions.—By approaching marking 
from this standpoint we obtain a rational scale, not only of the good- 
ness or badness of pupils’ work, but also of the ease or difficulty of 


questions or test items. If, for example, five items are answered 


correctly by 16%, 224%, 31%, 40%, and 50% of a large group of 
persons, then we can state that (for these persons) the items are all 


1 terminology, connotes either 


equally spaced in difficulty. For these proportions correspond to^ 


values of + 1:0, + 0:75, + 0:50, +; 0-25, and 0-0 respectively. 
Similarly some questions which are below average difficulty will be 


equally spaced if answered by, say, 91%, 94%, and 96% of pupils, 
for here the proportions of incorrect answers, namely 9%, 6%, and 


4%, correspond to = values of — 1:35, — 1-55, and — 1-75 respec- 


tively. 
Justification for Equalizing the Dispersion in all sets of Marks.— 


The claim that mathematical accomplishments vary more widely 
than do accomplishments in other subjects requires some further 
consideration. It is true that it is usually easier to discriminate 
between different levels of mathematical than of, say, English com- 


position, ability, because tests in the former can generally be marked 
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more objectively and accurately. Thus in the work of an ordinary 
school class it may be impossible to distinguish more than about ten 
different levels of essay writing, whereas perhaps a hundred levels of 
mathematical skill may be distinguished. Although there is some 
justification for attaching greater weight to marks obtained by the 
more efficient measuring instrument, yet this greater efficiency is no 
criterion of the range of ability, as the following analogy shows. In 
a school quarter-mile race, one referee may time the boys with a 
seconds’ hand watch, and find that some take 50, others 51, 52, · · . 
60 seconds. Another referee may possess a tenth-of-a-second stop- 
watch and find that some boys take 50-0, others 50:1, 50:2... 59-9, 
60-0 seconds. But the range of ability is the same in both cases. 

It is only fair to mention that some authorities such as G. Н. 
Thomson (1939) have suggested that greater variability would be 
found in mathematical than in other abilities, if it were possible to 
measure variability absolutely. But there seems to be no satisfactory 
or convincing way of achieving such absolute measurement in 
psychology. Perhaps it is wisest, therefore, to follow the conyention 
that mental testers have adopted of assuming the same dispersion 
for all abilities in representative groups of all ages. es 

The Marking of Small Groups.—Peculiar difficulties still remain in 
applying the principles and techniques of this chapter to the marking 
of small groups. The mean and the S.D. of marks among, say, half 
à dozen Honours students may vary far more from year to year than 
they do among 40 or 100 students. If the average is taken as 60% 
this year, the odds are even that it will lie somewhere between 57 and 
63 next year, but it may drop as low as about 50%, or rise as high am 
7075. And if there is only one candidate who is awarded 60% this 
year, a single candidate next year may deserve anywhere from 30% 
to 90%. Until such time as objective tests are available, standardized 


on Honours students from many universities, or over several years, 
the examiner can only employ subjective marking scales, as indicated 
in Chap. II (p. 35), 


TEST NORMS 


Test norms are tables of the Scores of large groups of children or 
adults by reference to which iti 


person, or group of persons, to whom 
al tester fully admits that a test perfor- 


mance, as such, tells us nothing as to its goodness or badness, and 
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that it must be compared with the performances of other similar 
persons tested under the same conditions. Thus some tests are stan- 
dardized, and norms are obtained, by being applied to large groups 
of typical 11-year junior school children; others are given to school- 
leavers aged 14 to 15, or to grammar school pupils; others to adult 
Army recruits, or to university students, and so forth. American 
testers often publish grade norms, ie. the standards obtained by 
successive school classes in their elementary and high schools. Here 
too a number of schools representative of all levels must be chosen, 
on account of the considerable variations in standards between 
different schools. 

A mere statement of the average or median scores of such groups 
is, of course, insufficient. Often the deciles are quoted, so that it is 
possible to state that a person’s performance is, say, as good as that 
of the best 20% of the group upon which the test was standardized. 


х ч : 
A system based on — values, ог standard scores, is preferable since, 
с 


as shown aboye, percentile units vary widely in magnitude at differ- 
ent parts of the scale. A system which is becoming widely adopted is 
first to determine the percentiles and then to conyert these into 
standard scores with an arbitrary S.D. (usually 10, 15 or 20) and an 
arbitrary mean (usually 50 or 100). By this device the converted 
scores may be regarded as an equal-unit scale, even though the 
original distribution of raw scores was markedly non-normal. Thus 
with a S.D. of 15 and a mean of 100, whatever raw score falls at the 
84th percentile is converted to 115. The range of standard scores on 
this scale would be from about 145 to 55. 

It should by now be obvious that the usefulness of such norms is 
strictly limited by the nature of the standardization group. Norms 
are purely and simply a summary of the test performances of a 
particular group of testees, and must be interpreted with reference to 
that group. For example, if an American educational test is given 
to a British child and his score falls at the 70th percentile according 
to the norms published with the test, this means that he did better 
than 70% of the American children who took the test, but does not 
necessarily show that he is above the British median or average. In 
point of fact American educational tests are almost useless in this 
country because their standards differ so greatly from ours (cf. 
McGregor, 1934), and, of course, because their content and wording 
are often unsuitable. Probably the same is true of group verbal 
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intelligence tests, though non-verbal tests and individual scales such 
as Terman-Merrill and Wechsler-Bellevue (with minor adaptations) 
Scem to give comparable results (cf. Chap. IX). Even between C 
land and Scotland the differences are so big that few tests whic 
have been standardized in one country can be used in the other, 
unless fresh norms have been collected (cf. Vernon, O'Gorman and 
McLellan, 1955). Theappalling efficiency of Scottish teachingmethods 
raises the performance of Scottish elementary school children on tests 
of the 3 R's by anything from 10% to 50 % over the English level. Such 
evidence as we Possess, however, points to no appreciable differences 
Оп tests of general intelligence. 
Many tests have been adequately standardized in London or 
Glasgow, but are not ap 
areas where the educati 


the adequacy of many of the published test norms 
are so dubious that we would advise testers in schools to treat them 
all with caution, and when possible to do without them. Once a whole 
€ children of a certa. 
d's standin 


Standards. And if the 


! Cf. Burt (1937). The reasons for this effect are probably partly environmental 
and partly hereditary, Ро; nd upbringing undoubtedly have 
many adverse їпїї ion. t in addition there is indisputable 
evidence of a connection, albeit a small one, between intelligence and Social class, 
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several years, an extremely useful set of ‘local’ norms could quite 
easily be collected, either in terms of percentiles or of с scores. 


AGE NORMS 

Age norms may be defined as follows. A child who obtains a 
Mental Age or M.A. (on an intelligence test), or an Educational Age 
or E.A. (on an educational test) of x years is one whose performance 
is as good as that of the average child of Chronological Age or C.A. 
of x years. For example, a child aged 10-0 may do as well in an 
intelligence test as an average 11-0-year-old, but only score up to the 
level of a 9:0-year-old on tests of reading and arithmetic. He is then 
said to possess C.A. 10-0, M.A. 11-0, Reading and Arithmetic Ages 
of 9-0. At first sight this system of expressing test results is far simpler 


than the percentile or x systems. It provides a uniform scale of units 
а 


by means of which а child’s relative performances on any number of 
tests can be compared. His strong or weak points in different school 
subjects, or even in particular aspects of school subjects (e.g. the 
main arithmetical skills), and his superiority or inferiority on verbal 
and non-verbal intelligence tests, etc., can be seen at a glance from 
a table or graph of his various E.A.s and M.A.s. Nevertheless, there 
are many serious weaknesses in this type of norm, which we shall 
have to describe in detail. 

Difficulties in Establishing Age Norms.—First there are almost 
insuperable difficulties in collecting standardization groups which 
will be truly representative of all the children of a certain age. Be- 
tween 6 and 11 years the primary schools will provide groups whose 
average scores are probably close to the averages in the total popula- 
tion. Any one age group will, of course, be scattered through a large 
number of different school classes. But the range of ability in primary 
schools is restricted, since some of the brightest and dullest sections 
of the population attend private and special schools respectively. 
And, as we have seen, the range or dispersion of a distribution is 
quite as important as its average. Beyond 11 to 12 years, when 
primary school pupils proceed to various types of schools, it becomes 
increasingly difficult to secure samples whose averages, let alone 
ranges, are likely to be typical. In a very few instances results have 
been obtained from practically all the children in a given district,? 


1 Cf, Fraser Roberts ef al. (1935). 
M.A.—6 
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or from all the children born on certain dates.! But often the Ка 
that can be done is to control the socio-economic factor. ae и 
that a district contains a hundred schools, and that a tester A 
to standardize a test on pupils in ten of them, he should sce tha i 
Schools he selects cover the whole range of socio-economic 159 e 
due proportion. An extension of this method was.used by S 
(1934) in establishing adult norms for an intelligence test. His tes i 
were classified according to their occupations, and the ae 
each main occupational level were compared with the correspon p 
numbers in the total population. Since he had an unduly high ae 
portion of professionals, and too low a proportion of labourers, 
reduced the weight of the former and increased that of the latter 
group, before calculating the mean and the S.D. of test scores for his 
combined groups, 
The same difficulty in obtaining represent 
from tracing accurately the course of inte 
growth from childhood, through adolesc 
available evidence, however, shows fairly 
is not entirely regular, which means that 
equivalent throughout. The year by ye 
an average person seems to be reasona 
10 years, after which the rate of incre: 
units become progressively smaller, u 
Somewhere in the neighbourhood of 
limit is reached at different ages on d 
individuals. When the growth curves o. 


followed through, they have shown со 
born and Rothney, 


improvement after 
vided their educat 
complex verbal tes 


ative samples hinders us 
llectual and educational 
ence, to adulthood. The 
definitely that this growth 
M.A. and E.A. units are not 
ar increase of intelligence in 
bly constant from about 3 to 
ase diminishes, and the M.A. 
ntil a constant level is reached 
15 years. Probably this upper 
ifferent tests, and by different 
f particular children have been 
nsiderable irregularities (Dear- 
1941). Average adolescents may show no further 
about 12 on a simple performance test but, pro- 
ion continues, they may go on advancing on 


ts even up to 18 or 20. The upper limit for a very 
dull child is, of course, far lower; he may never surpass the per- 


formance of an average 10-year-old on any test. It is not yet certain 
whether he reaches this limit at the same С.А. as an average child 
reaches his, very bright child, on the other 
€ the normal 15-year level, and 
AS TG] up to 22 or even 
higher. But these units are entirely artifici 


that of an average 
on (1933), 
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person aged 20, but only that his performance is a good deal better 
than the performance of average persons aged anything from 15 
years upwards. We must conclude, then, that M.A. units are so com- 
plex, and likely to be so uneven above about 12 years, that it would 
be far better to discard them altogether at such levels. 

Lack of Equivalence of M.A. and E.A. Units.—Still less is known 
about E.A. units. Possibly the increase is fairly steady from about 
6 to 10 or 11 years, but then there is a strong tendency in most 
schools to force children unduly, in order that they may pass the 
secondary school selection examinations. Thereafter some relaxa- 
tion occurs, at least on the formal subjects. Hence the 10- to 11-year 
unit is likely to be exceptionally large and the 12- to 13-year unit 
exceptionally small. The units may then tail off among those in 
modern schools, but among grammar school pupils and university 
students there is likely to be progress well into adulthood. Clearly 
then E.A. and M.A. units do not provide comparable scales much 
above 10 years. A child of 13 with an M.A. and an E.A. of 14 is not 
equally superior in intelligence and in education, since 14 minus 13 
is a bigger quantity in the latter than in the former. Nor, of course, is 
this superiority of 1 year equivalent to the superiority of a child of 7 
whose M.A. and Е.А. are 8 years. 

Not long after Binet put forward the M.A. system of scaling in- 
telligence, the discovery was made that a child's relative advance- 
ment or retardation do not remain constant, but increase as his age 
increases. Burt (1917) proved that the same was true of E.A. A child 
of 5 with an M.A. or E.A. of 6 will usually show an M.A. or E.A. of 
12, not of 11, when he reaches C.A. 10. In other words, a superiority 
of a year in a young child is much greater than the same superiority 
in an older child. But the ratio of M.A. to C.A., or E.A. to C.A., 
does seem to remain approximately constant, hence the I.Q. and 
E.Q. have been very widely adopted as measures of intellectual and 
educational advancement and retardation. 

Ambiguity in Г.О. Units.—There is yet another flaw both in M.A. 
and I.Q. and in the corresponding educational units, namely that the 
range and S.D. of scores expressed in these units are wider for some 
tests than for others. For example, the application of educational 
tests to a class of 9-year-olds might show that their average E.A. was 
9, and that 20% of them had E.A.s of 10 or more. But the application 
of a group intelligence test to the same pupils might show an average 
M.A. of 9 and as many as 30% with M.A.s of 10 or more. If this was 
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so, then clearly an advancement of 1 year, or an І.О. or Е.О. of 111 
10 


(i.e. F X 100), does not mean the same thing on different tests. Some 
intelligence tests yield I.Q.s with a S.D. of 15 or even less, oe 
yield I.Q.s with a S.D. as high as 25 or more. Probably educationa 
tests show similar irregularities. The S.D. of the old Stanford-Binet 
test I.Q.s was established as about 15, and many modern tests such as 
the Moray House group tests and the Wechsler Intelligence Scales 
for Children and Adults have adopted the same figure as a conve- 
nient one. Terman-Merrill I.Q.s tend to be more widely dispersed; 
their average S.D. being quoted as 161. But it varies from 124 to 20 
at different age levels. This means that 1.Q.s (with the same mean of 
100) have different meanings from one test to another, or on the same 
test at different ages. An ordinary school class will commonly obtain 
a range of I.Q.s from about + 2e to — 2c, i.e. from 130 to 70 on a 
test with a S.D. of 15. But the same class tested with a group test 
whose S.D. is 25 will range in 1.0. from 150 to 50. To take another 
example: a class of very bright children may have an average I.Q. of 
115 on a Moray House test. The same class tested with another group 
test may have an average I.Q. of 125. The following table illustrates 
the proportions to be expected in the total population with I.Q.s 125 
or more, or 75 or less, on tests with various S.D.s. 


% LQ. > 125 
S.D. or < 75 
ТОМЕ 1207. „уе 24 

15 Аа. 5 

163 b NUN ES А. 64 

20 RRS Ыт 

25 


; 16 $ 
Unfortunately it is impossible to assert that any particular value 
of the S.D. of I.Q.s is the ‘right’ one, and that the other values are 


too large or too small. The value depends on certain features of the 
test which are still somewhat ob 


cussion here.' The fi 
border-line for ment: 
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line cases has been carried out with the Stanford-Binet test whose 
S.D. kept fairly close to 15. The only satisfactory solution to this 
problem is to give up the Mental and Educational Age system of 
calculating I.Q.s and E.Q.s, and to establish separate percentile or 
standard-score norms at each age level for which the test is suitable. 
The conventional form of I.Q. has been, and is still being, used far 
too loosely, as though it represented a standard scale of measure- 
ment, independent of the particular test employed. Actually, like 
any other test score, its interpretation is possible only if its S.D. (as 
well as the mean of 100) is taken into account. The derivation of 
standard scores may be somewhat more difficult for untrained 
testers to grasp. But such scores do have a uniform significance with 
different tests at different ages, and—when the S.D. of 15 is adopted 
—they do provide a scale which is very closely comparable to the 
older, generally accepted, scale of I.Q.s. 


CHAPTER V 


RELIABILITY AND THE THEORY OF SAMPLING 


Untrustworthiness of Results obtaine. 


d from Small Numbers of Persons.— 
Thischapter 15 likely, 


we must admit, to cause considerable difficulty to 
readersunfamiliar with statistical concepts. Itintroducesa point of view 
with regard to educational measurements and experiments which may 
at first seem very complex and involved, yet which is absolutely essen- 
tial to. a proper understandin g of scientific education and psychology-' 
The scientific investigator naturally wishes to arrive at results 
which are true for people in general. He wants to prove that a certain 
method of teaching gives superior results with ай children, that the 
correlation? between intelligence tests or school examinations and 
subsequent scholastic achievement always reaches a certain figure, 
or he wants to know the average test score of all persons brought up 
in a particular environment, e.g. of rural as contrasted with urban 


children, or Scottish as contrasted with English, and so on. But he 
can seldom, if ever, 


o work with samples of the 
to regard his result—whether 
two averages, oracorrelation 
lt, which might alter to some 
samples of the population. 

in the first place to see that 
one. Supposing his problem 
telligence test, and he wishes 
f scores among children aged 
ious chapter, avoid choosing 


* The reader 


advised to s 
chapter. 


n given by Thouless (19395). 
€ conception of correlation may be 
chapter before continuing with this 
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comparison of the intelligence of children in suburban and industrial 
areas, he could not measure all the schools of these types. Thus the 
difference which he obtained would only be an approximation to the 
difference that might result from much more complete investigations 


Here are two concrete instances. 

Ex. 27.—Twelve of the writer's training college students were 
given an American objective test of reading ability, namely the 
Nelson-Denny test, which measures knowledge of word meanings 
and comprehension of paragraphs. Their scores, arranged in order, 
were: 140, 136, 135, 133, 126, 120, 109, 108, 98, 89, 87, 82. 

Now, most of these scores are well above the average for American 
college students, namely 96. Their average, 113-6, falls at the 78th 
American percentile. But obviously the scores are very variable. 
The question is, then, are we or are we not justified in deducing from 
them that Scottish training college students possess higher reading 
ability than American college students? Obviously no such deduc- 
tion could be made from the score of only a single student, but does 
the average of a sample of 12 provide a much surer basis? The 
12 were picked at random from among the writer’s 264 students. 
Thus he was careful to avoid selecting only those, or a prepon- 
derance of those, who had an Honours degree in English. And men 
and women were equally represented, since it was conceivable that a 
sample composed only of women or only of men would be biased. 
Yet it is quite possible that he unintentionally picked an undue 
number of clever students, and that if he picked another dozen, 
their scores might be mostly below 100, and their average no higher, 
or even lower, than the American figure. How liable then is such 
an average of twelve measures to vary? 

The question can be answered in this instance, since the same test 
was given to 21 other sets of 12 students, also chosen at random. 
The 22-averages range from 103-6 to 126-8, their distribution being 
shown in the following table. As we suspected, the averages are 
rather variable, though less so than the separate scores, which range 
all the way from 62 to 160. And though our first average of 113-6 
happens to be fairly close to the grand average of 114-9, yet we might, 
in the luck of the draw, have obtained an average as much as 10 


points above or below this figure. 
Average Mark Frequency 
124 > 


121 + 
118 + 
115 + 
112 => 
109 + 
106 + 
103 + 


Ni 
T9! — tà t9 оошо t2 02 
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Ex. 28.—The correlation coefficient was calculated between та 
marks of 25 students on an examination given near the beginning o 
the year, and their marks on another paper at the end of the y al 
The coefficient was -+ 0-53. But 25 students is only a very starr 
sample, and similar results calculated for 13 other similar samp 
varied widely, as shown in the following table. 
Correlation 
Coefficients 
-60 to + 0-69 
50 to + 0-59 


+0 
+0 
+0 
+ 0: 
+0 
+ 0: 

0 
=0 


о 

+ 

© 

W 

~ 
ee NU N а | 


| 


14 
ginal figure is far from trustworthy, and even me 
average coefficient of + 0:315, for 350 stu dents, though probably 
much more reliable, might also alter to some extent if the same, 
p minatiOns were compared among other similar large groups О 
students, t 


always looks on any 
deviate more or less fr 
of selection such as 


normal curves. 
Let us take a fresh i 


nstance, where onl 
available, in order to 


Y a single sample result is 
illustrate the fur 


ther stages of the argument. 
ge scores for 136 


men and 114 women students 
gence Test III (cf. Ex. 3, p. 10) were 101-55 and 
99-65 respectively. Thus the men were 1-90 points superior. 


Can we deduce that men students, of the type tested, are in general | 
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better than women, or would we, if we applied the same tests to 
many other similar groups, find the difference to be larger in some 
cases, negligibly small in other cases, or even reversed in others? 
This difference, which we shall call d, is only a sample one, which 
should be regarded as falling at some point on the distribution of a 
large number of possible alternative d’s. Clearly the best possible 
estimate of the true difference between men and women would be a 
grand average of all these hypothetical d's. This quantity—the mean 
of the distribution of d’s—we shall call D. The particular d which we 
know is a more or less accurate approximation to D. 

A formula has been worked out by means of which it is possible 
to predict, from the actual data available, what would be с,, that is 
the S.D. of this distribution of d’s. The formula will be given later. 
In the present instance it yields c; = 1-41. As the distribution is a 
normal one, this figure will enable us to define its limits. We can state 
that about one-third of the d's must fall between D and D + 1-41, 
another third between D and D — 1-41, and that about one-third of 
them must deviate more widely from D. Scarcely any of them, or 
only 0:135 %, will be as large as D + 4-23 (i.e. Зо), and only 0:135 7; 
will be as small as D — 4-23. Hence 99-73% (i.e. 100 — 2 х 0-135) 
of the d’s lie within the limits D + 4-23, and the odds against any 
d lying outside these limits are 99-73 to 0:27, or 370 to 1. As a general 
rule, the probability is 370 to 1 that any d which deviates from D 
purely on account of chance errors of sampling will not do so by 
more than 30,. 

Certain other probabilities have been deduced from the probability 


integral graph (Fig. 18), and are listed in the following table: 
Frequencies of d's Odds against d's P 


Sa falling outside these falling outside value 
limi 

+0: с 2 x 25:0% 1101 -50 
+ КЕ 3 2x 159 Nearly 2tol 30 
+ 1-645 o, 2x 50% 10 to 1 10 
+ 1:96 о, De YRS 20 to 1 :05 
+ 2:58 o, 2x 0055 100 to 1 01 
+ 3:29 a, 2x 000575 1,000 to 1 :001 


Now these predictions must 
happen to know, whose value is 
its deviating as much as 3:29 o, 
deviate as much as, say, 1:96 o» 
lower. We can therefore, as it Were, 
With practical certainty that D mus 


from D, but it might quite possibly 
since the odds against this are much 
reverse the argument, and state 
t lie between 1:90 + 3:29 су, i.e. 
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between + 6-54 and —2-74. We do not know, of course, what is E: 
value of D, but if it fell outside these limits, our d would be mo 
than 3-29 c, removed from it, and the odds against this are Wat. 
Similarly the odds are 100 to 1 that it lies between 1-90 + ds bat 
= + 5:54 and —1.74; and 20 to 1 that it lies between 1:90 + Я {е 
= + 4-66 and — 0-86. These latter limits are commonly MET d. 
as the -01 and -05 confidence limits, respectively, of our орша 5 

The Standard Error as an Index of the Trustworthiness of a Result. а 
It may be seen then that с, provides us with an index of the pe 
of uncertainty in our d. It tells us that the difference of 1-90 in PS 
of men students is by no means trustworthy, since the true value 
D may be considerably larger or smaller. inks 

This is the manner in which the scientific investigator always i. 
of the results of PSychological or educational experiments. His О 
tained result, being only а sample based on a limited number of ae 
is likely to deviate to a certain extent from the true result which f 
would obtain if he could experiment on a very much larger set 0 
people. He can, however, work out the o,, often referred to as ш 
Standard Error or S.E. of his result, which tells him, not precisely 
how much his result is in error, but what is the probability that his 
result is in error by various amounts. There is a 2 to 1 chance that the 
error may be no bigger than the S.E., and the odds against the P 
rence of a larger error rise very rapidly, so that it is extremely unlikely 
to be greater than three times the S.E. 

The Probable Error.—The figures at the head of the table, above, 
show that there is an even chance of D lying within, or lying outside, 
the limits + 0-6745o,. This particular fraction of the S.E. is known 
as the Probable Error or P.E. The name is not a very good one, but 
in a sense it is the most probable error, since the odds against either 
larger or smaller errors than this are higher than 1:1. In our example, 
the P.E. = 0.6745 x 1.41 — 0-95. This means that D is just as likely 
to lie between 1-90 + 0-95 and 1-90 — 0-95 as it is to lie outside these 
limits. 

The P.E. is frequentl 
margin of uncertainty, 
that the odds against D 
to 1, we can say that a 
of unlikelihood, since 
Or if + 1-96 x S.E. ar 

reasonably be expected 


ly used as an index of the unreliability or 
in preference to the S.E. Instead of saying 
deviating from d by 3-29 x S.E. are 1,000 
deviation of 5 x Р.Е. possesses this degree 
5 x 0:6745 is approximately equal to 3.29. 
€ regarded as the limits within which p may 
to lie, we can alternatively denote these limits 
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as + 3 x P.E. Often a numerical result such as our 4 of 1-90 is 
stated in the form: 1-90 + 0-95, where the latter figure is the P.E. of 
the former. While both the S.E. and the P.E. indicate the likely 
amount of error in some numerical result, the S.E. is, as we have 
seen, primarily a measure of the dispersion of all the possible sample 
results. The same is true of the P.E. In the hypothetical distribution 

of sample d's, the limits D + P.E. to D — Р.Е. enclose 50% of these 
d's, and 25% of them are greater than D + P.E., 25% less than D — 

P.E. Note therefore that the P.E. is identical with Q, the semi-inter- 

quartile range, of this distribution. For the limits Q, to Q; also mark 

off the middle 50% of measures in a normal distribution. 

The Statistical Significance of a Difference.—Many experiments 
have shown that the scores of comparable groups of men and women 
on intelligence tests are practically identical, or that any differences 
found are not reliable. Since, in our example, the odds are only 21 
to 1 against D being greater than 4-72 or less than — 0:92, it seems 
quite possible that here too the d is not reliable, and that the true 
value of D is zero. If this be so, then it is easy to determine from the 
probability integral graph the chances that a sample might show a d 
of + 1-90, merely through errors of sampling. 1:90 is 1-35 x 1-41, 
or 1-350,. Consulting the graph (Fig. 18), we see that 9% of measures 
in a normal distribution lie at or beyond 1-350 from the mean. Hence 
when D is zero, 9% of the sample d's amount to -- 1-90 or more, 
and 41% lie between 0 and + 1:90. Thus the odds against a d of 
1-90 being due to errors of sampling are only 41 to 9, or about 44 to 
1.1 In other words, it might occur as a matter of chance once in four 
or five testings of similar samples of students. Such odds are far too 
low for us to put any trust in the conclusion that men students are 
better than women at this test. Had d been 3-63 (i.e. 2:580,) we might 


1 The acute reader may well ask here: why not contrast the 9% of sample d’s 
of -+ 1:90 and over with the 91% of smaller and of negative d’s, thus arriving 
is procedure is denoted as the one-tailed test, since 


or larger than a certain ато! 
attempting to prove that thi 
direction—positive or negative. Others, ‹ k 
test under almost all circumstances, since it provides a more conservative 
estimate'of the trustworthiness of their resu 
in effect, implying that the true sex difference | 
women, rather than that the true difference might be in favour of men or else 
е Zero. 
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have felt reasonably certain, since this difference could arise by 
chance only once in a hundred times. It would then be referred to as 
Statistically significant at the -01 level. Note that ‘significant does 
not necessarily imply ‘meaningful.’ We are not concerned here with 
the psychological interpretation of any difference. Statistically sig- 
nificant refers purely to the reliability or trustworthiness of the figures. 

A difference between the results of two groups, divided by its S.E., 
is often called the Critical Ratio or C.R. In this example the C.R. is 
only 1-35. It is customary to place little reliance in a difference whose 
C.R. is less than 2 or preferably 3. Alternatively a d is regarded 85 
significantly greater than zero if it exceeds 3, or preferably 5, times 
its Р.Е. But the border-lines are quite arbitrary matters of conven- 
tion; we are certainly not Justified in rejecting all smaller differences 
because they are conventionally non-significant. Suppose the С.К. 
was 1:645 (d = 1-19), with a probability of 1 in 10. Though this 
might easily arise by chance, it does not prove that the true value of 
D is zero. Indeed, according to the -01 confidence limits, the trug 
value might lie anywhere between -+ 4-82 and — 2:44. Such a critical 
ratio is said to be of borderline significance; since it suggests that 
true positive difference might be slightly more likely to occur than а 
Zero or a negative one if we were able to investigate further samples. 

Another note of caution is necessary. Even if d had been large 
enough in the above example (relative to its S.E. or P.E.) to be 
accepted as Significant, the conclusion that men are somewhat 
Superior at the test would have applied only to the type of students 
tested, namely Scottish university graduates who had chosen the 
teaching career. Tt could not be extended to English training-college 
students, nor to other university students, far less to men and women 
in general, since the persons actually tested could not be regarded as 
typical or representative of these wider groups. Again it would be 
safer not to claim that such a difference was a difference in intelli- 
gence, but only in the ability which is measured by this particular 
group intelligence test. The statistical treatment which has been 
applied to the distributions of scores only informs us how sound and 
reliable are the results in respect of other similar testees, i.e. it takes 


care of sampling errors, but it does not eliminate errors or bias in 
selection. As mentioned in Chap. I, statistical methods cannot im- 
prove data which are in any way inaccurate or defective ; they merely 
bring out the significance (or lack of significance) of the data actually 
obtained. 
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FORMUL/ FOR THE STANDARD AND PROBABLE ERRORS OF 
NUMERICAL RESULTS 

Common sense tells us that a numerical result such as an average, 
or a correlation, or a difference between two averages, will always be 
more reliable the greater the number of cases on which it is based. 
In Ex. 27 (p. 75) it was shown that the averages obtained by sets of 
twelve students on a test of reading ability were exceedingly variable. 
Obviously twelve is far too small a sample upon which to base con- 
clusions. In order to make a trustworthy comparison between the 
performance of Scottish and American students at this test, we 
should need much larger numbers. Most of the formule for relia- 


1 
bility involve the quantity >=, which means that by quadrupling 


УМ 
the size of a sample, we can halve the variability of any result com- 
puted from this sample, or double its reliability. Another important 
factor which governs the reliability of a result is the variability among 
the original measures. Suppose that, in Ex. 27, all the men students 
had scored 101 or 102 (average 101-55), and all the women had 
scored 99 or 100 (average 99-65), we should certainly have been 
entitled to claim that men students do better at the test than women. 
But it is because the men's scores range all the way from 73 to 134, 
and the women from 78 to 129, that we felt suspicious of the difference 
in averages, and applied statistical treatment to it. There was so much 
overlapping that about 42% of the women obtained higher scores 
than the average man, and about 42% of the men scored lower than 
the average woman. Hence the reliability formule also involve some 
measure of the dispersion or extent of variation of the original 
measures. 

The S.E. of a Difference.—This S.E. is given by the formula: 
== оу? 
с = N 


Here o, and o; are the S.D.s of the scores in the two groups, М, and 
М, are their numbers. The P.E. of a difference is 0:6745 times this 


quantity. 


P 
TH; 


let us compare the scores on the 
ents having Honours degrees in 
aving Honours in Science or 
follows: 


Ex, 30.—As a fresh illustration, 
Nelson-Denny reading test of stud 
Arts with the scores of students h 
ordinary degrees. The essential data are as 
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N M 0 
16:42 
Honours Arts students . 66 129-47 СЯ 
Other students А . 198 109-97 18-9 


The difference between the two averages is 19-50. 


— [1642: 18-902 — 7088 x 1:804 = 2-427 
> ES Marone AUE 


: ifference 
Here the critical ratio is E Э, or approximately 8. Such a differ Ё 
i i en 
could not possibly occur through chance errors of ia zi TO 
we may conclude that Honours Arts students, of the type > 
definitely superior to other graduate students at this test. 


S.E. of a Difference between Averages of Scores which M 
correlated—When опе group of persons takes two tests, or ift 2 
Iepeat a test, and we wish to examine the statistical significance з 
the difference between their two average scores, we must use ê 


5 А В ` ; tion, 
modified formula. This applies whenever there exists a correla 
r, between the two sets of scores. 


pc җе” + oa? — 2гоуо») 
s in English and arithmetic were awarded to a class 
of 57 children. T 


e 
hese are tabulated in Ex. 23, р. 57. Apparently th 
arithmetic mark. 


5 run higher than the English ones. Is the difference 
significant? The figures are as follows: 


Ex. 31.—Mark 


Mean Difference ; A N 
English : - 67:615 Е ў 57 
Arithmetic . | 70-510 2:895 4:64 


The correlation between the two sets of marks is indicated by а 
Coefficient of + 0-542. 


% = ү3:(12:948 + 14-642 — 2 x 0:542 У 12:94 x 14:64) = 1-761. 
The difference of 2:895 is only 1-64 times its S.E., hence it is not 
significant. 


S.E. of a Mean.—An average is, as we have seen in Ex. 27, quite 
as likely to vary through errors of sampling as a difference between 
averages. Hence it is very useful to know how near to the true 


average of a very large group is likely to be the average obtained 
i i i iates calculating the 
1 Ап alternative version of the formula, which obviates calcu g 
correlation coefficient, is given below in the section on small samples (p. 85). 
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from a group of limited size. The S.E. of the mean is often symbol- 


ized by o,, and is given by the formula: а, = VA Here c is, of 


course, the S.D. of the original scores, whose average has been 


computed. 


Ex. 32.—The twelve scores on the Nelson-Denny test, listed in 
Ex. 27, have a mean of 113:58 and S.D. of 20-07. 
20-07 
о, = oS =-` 5 
«= IA 5-794 
The Р.Е. of M is 0-6745 x S.E. = 3:908. 

Thus the true mean for students of the type tested may just as well 
lie outside the limits 113-58 + 3-91 (i.e. 117-49 to 109-67) as within 
them. It may be as much as 2-580, away from 113-58, i.e. as high as 
123-66 or as low as 103:50, but it is unlikely to exceed these 0:01 
confidence limits. Obviously the figure is extremely unreliable. 

By contrast the mean for 264 students was 114-9 and the S.D. 

20:15 


20-15. Here o, = 7264 — 1:24. The sample is 22 times as large, 


hence o, is 75 its previous value. The P.E.,, = 0:836. From these 


figures it follows that there is an even chance of the true mean lying 
between 115-7 and 114-1, and that it almost certainly does not lie 
outside the limits 118-1 and 111-7. 


S.E. of a Standard Deviation and of a Difference between S.D.s.— 
Not only the mean but also the S.D. of a set of measures may be 
affected by errors of sampling. The formula for the S.E. of a S.D. is: 

221024 
ба = МЭМ 


The S.E. of a difference between two S.D.s is given by: 


Ex. 33.—The S.D.s of the intelligence test scores of men and 
women students were 12:36 and 9:95 respectively, indicating that 
the dispersion of ability was greater among the men. The S.E, of the 
difference of 2:41 between these two figures is: 

12% 998 
оа pesce (196 
$ 2х 136 ' 2 x 114 > 
he diff is 2:42. ti А ; А e 
ifference is 2-42 times its S.E., and this shows a fair degree of 
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Statistical significance, The odds against such a Aitference осоп 
through chance errors are 49-23 to 0-77, or 64 to 1. The dents 
therefore allows a reasonably strong presumption that men ЕВЕ át 
of the type tested are in general more Spread out in their ability 
an intelligence test than are women students, 

S.E. of a Percentage and of a Difference between Wai 
Often it is not Possible to measure the amounts of some quality nf 
group of persons, but only to state what percentage of the рез ae 
Possess or do not possess it, Suppose that in several schools 60 i A 
the boys and 65 76 of the girls regularly take milk, does this indica zh 
reliable sex difference? Again the trustworthiness of the result depen: 
on the number of cases. If p is the percentage, and а = 100 — Р, 


then the S.E, of p is S. and the S.E. of a difference between two 


percentages is: ts 
Didi , Poa 
4 NUN 


Ex. 34. —From how many children would it be necessary to obtain 


milk-drinking records in order to be practically certain that the 
difference of 65% — 60% is reliable? 


This example is posed in the 


Б. is, the S.E. must not be greater than 
2. Let the number of both boys a 


nd girls in the schools where records 
are obtained be N. Then: 
60 x 40 65 x 35 
r ——— N = 1,169 
i Jens 


The answer is, then, that the sex difference of 597 would be signi 
ficant if based on some 1,200-odd boys and an equal number of girls, 
but that it would not be so if the numbers were much smaller. 


One of the commonest faults in writings about educational topics 
is the quotati 


quoted above for the S.E. 
of a difference (between means, between S.D.s, and between per- 


centages) are derived from one and the same formula: 

9, = Vot F сз? — Зтола; 

the S.E.s of the means, S.D.s, and Percentages 
€ S.D.s of the original Scores). The third term, 


Where o, and c» are 
Tespectively (not th 
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Згајау, applies whenever there is any correlation between the original 
measures from which the means, S.D.s, or percentages are derived. 
But when, as in most of our examples above, there is no such correla- 
tion, the term vanishes. 

Statistical Treatment of Small Samples.—It has been shown by 
Fisher (1930) and others that numerical results obtained from very 
small samples are even less reliable than the statistical treatment 
described above would lead us to expect. Under such circumstances 
it is better to calculate a S.E. by substituting N — 1 for N in the 


denominator. 
The resulting critical ratios are denoted by the symbol 7, and 


special tables are available which show the probability of various 
values of ¢ for different numbers of cases.! For a difference between 
the means of two groups, the proper formula is: 

М; — М, 


© (2r z 
N-CN—-2/AMN № 
Zx,” and Zx,? are the sums of squared deviations from the respec- 


tive means. If calculated from raw scores, or from arbitrary means, 


the corrections described on pp. 46-7 must be inserted. 
For a difference between two sets of scores in the same group, it 


is simplest to tabulate the actual differences, d, between each person's 
two scores (allowing for signs). Then: 


This automatically allows for the correlation between the two 
variables. 

Note that this modified treatment, though statistically preferable, 
makes very little difference until the N’s fall below about 15. The 
enormous majority of educational experiments are carried out on 
much larger numbers. 

Analysis of Variance.—An extension of the t-technique enables us 
to explore the reliability of differences between several groups simul- 
taneously, instead of dealing with only two at a time. For example, 

! Cf. Fisher and Yates (1938). Instead of the actual numbers of cases, the 


tables refer to what are called degrees of freedom, which are normally N — 1. 
Thus for a difference between 13 boys and 10 girls, the degrees of freedom are 


12 4- 9 — 21. 
M.A,—7 
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we can test the significance of differences in means or percentages 
between sets of pupils taught by several alternative methods, or 
classes in several schools, or children classified according to several 
grades of parental occupation. Again the groups may be classified 
in two or more directions simultaneously, say by teaching melee 
type of school, sex and intelligence level. The relative importance o 
these influences on average test scores, and their statistical signifi- 
cance, can thus be assessed. Note that this involves an important 
departure from the model of the classical scientific expense 
where the effects of altering one variable at a time were observed, al 
other variables being kept constant. Hence analysis of ven 
Which was first developed by Fisher for studying the effects O 
different fertilizers or other conditions in agriculture, has b 
one of the most powerful and most widely used tools of educationa 
experimentation. It has also proved of particular value in analysing 
examination marks—how far these vary with the examiner, or with 
the question or topic set, or with the school from which the candi- 
dates come, and so on. а 
The detailed techniques аге somewhat elaborate for description 


here (cf. Lindquist, 1940), but the following simple example will 
give some indication of their nature. 


Ex. 35.—The writer’s men students were divided, according а 
their various courses of teacher training, into seven sections, eee 
numbering about 22 students. The average level of ability in differen 
sections seemed to differ considerably, and their average percentage 
marks on examinations in Psychology are listed in the secon A 
column of the table below. The third column shows, however, tha 
the marks in any one section were very variable : 


Section Mean Psychology Mark S.D. of Marks 
A 72:89 8-25 
B 68-69 8-43 
© 65-59 7-88 
D 64-41 7-05 
E 63:95 8:58 
Е 63:52 10:71 
а 63:48 10:71 


usually ranging from 45% to 85%. Thus in order to answer the 
question w 


hether there exists any relation between a student’s ability 
and the section to which he is 


assigned, we must determine whether 
these differences between the means of sections are significant or 
whether, in view of the variability of the marks within each section, 
such differences might arise through chance fluctuations of Sampling. 
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“Variance” is simply another term for spread, dispersion, or vari- 
ability (actually the variance of a distribution of measures is the 
square of its S.D.). Analysis of variance enables us to say how far 
these differences between sections can be ascribed to the effect of 
individual differences within sections. The result is expressed in terms 
of the F-ratio, in much the same way as the significance of differences 
between two groups is expressed as t. (Indeed an F-ratio when only 
two groups are compared is identical with 27.) In this instance the 
F-ratio is 2-93, and appropriate tables tell us that the probability of 
such differences between seven means arising by chance is quite 
small, approximately 0-01. We may conclude that there is a definite 
tendency for better students to be put into some sections, poorer 
students into others. 


S.E. and Р.Е. of a Correlation Coefficient. —The correlation be- 
tween two sets of measures is likely to be very unreliable if the num- 
ber of persons is small, as was shown in Ex. 28 (p. 76). The S.E. or 
P.E. of a correlation coefficient tells us how widely this coefficient 
may be expected to deviate from the true value, i.e. the value which 
would be obtained from an unlimited number of persons. The 
formule for these errors will be given in Chap. VI. 

S.E. and P.E. of an Individual Test Score or Examination Mark.— 
If some pupils or students take a test or examination two or more 
times, or answer two or more supposedly parallel tests in the same 
subject, they will not obtain precisely identical scores on each 
occasion. Hence any score or mark derived from only a single test is 
to some extent unreliable. It may deviate more or less from the score 
which would be obtained with far more thorough testing. In other 
words, the single test only gives us a limited sample of the pupils’ or 
students’ capabilities. Here too, then, the conception of the S.E. has 
valuable applications. But the reliability of a score depends, not on 
the number of persons who take the test, nor on the wideness of 
dispersion of scores, but on the thoroughness of the test itself, or the 
number of items of which it is composed, and on the amount of 
variability in pupils’ answers to these items. Elementary methods of 
calculation are given in Chap. VII. A full treatment of the underlying 


theory may be found in Gulliksen’s (1950) book. 
Finally, when a test or examination is employed for predicting a 


. 1 It seems to be the convention in psychological literature to employ the S.E. 
in the statistical treatment of averages and the like, and the P.E. in the treatment 
of correlations and individual scores. There is no logical reason for the conven- 


tion, but we shall usually adhere to it in this book. 
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pupil’s success or lack of success during his subsequent educational 
or vocational career, it is obvious that the estimate obtained from 
the test score will deviate more or less from the truth. The reliability 
of such predictions naturally depends mainly upon the adequacy or 
validity of the test. It, too, can be expressed in terms of a S.E. or PES 
and the appropriate methods will be given below. 


TESTING DIFFERENCES BETWEEN CATEGORIC 
VARIABLES BY CHI-SQUARED 


In educational and psychological research, there are many types 
of data which we may wish to analyse or compare, but which are not 
amenable to averaging or to the calculation of Standard Errors. For 
example, the colour of people’s eyes cannot readily be measured as 
a continuous variable, yet we can classify individuals into a number 
of more or less distinct groups or categories according to their гу 
colour. Similarly, we may wish to study the numbers of pupils at 
different ages who go to the cinema 0, 1, 2 or more times per weeks 
or the numbers who answer Yes, Doubtful or No to an item in 4 
questionnaire. Or we may wish to compare how many boys and 
girls are given gradings of VG, G, Fair and Poor in an examination. 


EX. 36.—A group of 80 male students stated anonymously ба 
their voting іп a General Election was: 23 Conservative, 13 Libere 
and 44 Labour. We shall take it that, in the country as a whole, the 
proportions had been 45%, 10% and 45%. Are we justified in con 
cluding that the students were more left-wing than the average, or 
might this difference arise just through chance errors of sampling: 
Our approach is to assume that no real differences exist between oi 
group and a group of 80 typical voters—this is termed the ‘nu 
hypothesis’, and then to analyse the degree of unlikelihood of the 
difference occurring by chance. 


In the following table, F gives the obtained frequencies of students 


— Fe?) 
F Fe am queni E E 
Conservative . 23 36 — 13 169 4-694 
Liberal . dr 13} 8 +5 25 3-125 
Labour . 144 36 +8 64 1-778 
80 — 80 x? = 9:597 


in each group or ‘cell’. F, represents the expected frequencies in an 


— 
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equal number of typical voters (45% of 80 = 36, and so on). We 
2(F — E 

F. 4 


F — F, column must add up to zero. The Chi-Squared Tables, 
given in most statistical textbooks, tell us that this value, 9:597, has 
a P or probability of close to 0-01, i.e. that it might arise by chance 
only about once in a hundred similar samples. Hence there is a 
fairly strong presumption that students of the kind we have studied 
are less conservative than the general body of voters. Had our sample 
been double the size, chi-squared would likewise be doubled, and 
the probability would have dropped to less than 1 in 1,000. 


then calculate Chi-Squared (х) = Note that the 


The same procedure is used for comparing two or more groups, 
except that we now ask whether our samples show the same distri- 
bution, not whether they conform to some predetermined typical 
distribution. 


Ex. 37.—Groups of 75 boys and 65 girls gave the following as 
their numbers of visits to the cinema per week. Is there a statistically 


reliable sex difference? 


E Fe (F — Fe)? 
No. of ~ Бе 
Visits Boys Girls Total Boys Girls 
0 5 10 15 8-03 6:97 1:14 1-32 
1 33 32 65 34-82 30:18 0:10 0:11 
2 26 18 44 23:57 20:43 0:25 0:29 
3 7 4 11 
4 3 1 4 8-57 7-43 0-69 0:74 
5 1 — 1 — 
= — = x? = 4-64 
75 65 140 74-99 65:01 Р = -20 


sis were true, we would expect the same pro- 


If h 
me: n each line of the table; i.e. of the 15 


portion of boys as of girls i E 
у а ЕЈ 

children with 0 visits, 140 x 15 — 8:03 should be boys, and jag * 

15 = 6-97 should be girls. Each entr 

calculated. Note that these Кз mus 


1 Such tables show the probability of chi-s 
freedom’. Degrees of freedom here means the 
be known in order to complete the distributio 
number of students is fixed, and we need only to know | 
two of the parties in order to determine the F in the third cell. 


y in the F, columns is similarly 
t add up to the same totals as 


quared for different ‘degrees of 
number of cells whose F's must 
n, in this case 2. For the total 
iow the numbers voting for 


90 THE MEASUREMENT OF ABILITIES 


the F’s both across and down the way. The last three rows are best 
combined together, since there is a general rule that no value of Fe 
in a chi-squared analysis should fall below 5, or preferably 10. 
Gi) 


F - is calculated as before, but this time for both columns. 
e 


The obtained value of chi-squared is not significant statistically’; 


our differences between the sexes could occur by chance 1 in 5 times 
in other similar samples. 


This particular problem might also have been studied by compar- 
ing percentages, using the formula on р. 84. 49:397 of the boys and 
35:4% of the girls go twice or more a week. The C.R. works out at 
1-68, which has a P value of 0-10. Though this gives a slightly more 
favourable result, it too provides no acceptable evidence of a reliable 
difference; and the more detailed chi-squared analysis is to be pre- 
ferred. Another point brought out by this example is that the F's 1n 
each cell should invariably refer to numbers of separate and indepen- 
dent individuals or items. A very common error among students 15 
to Say that 26 boys have 52 visits between them, and 18 girls 20 
visits, and so on, and then to apply chi-squared to the table of visits: 

Another useful application of chi-squared is in testing whether an 
obtained distribution conforms to the normal (or any other theo- 
retical) shape. We have seen that distributions of small numbers of 
measures may depart considerably from normality, and that by 10- 
creasing N, the irregularities tend to become ironed out. Here also, 
therefore, the result obtained from a limited sample (namely an 
irregular distribution) can deviate considerably from the result which 
a much larger sample would yield (a regular distribution) purely 


through errors of sampling: and the unlikelihood of this deviation 
can be calculated by chi-squared. 

EX: 38.—The following figures are extracted from Ex. 20 (p. 50). 
They show in the F column the actual frequencies of scores obtained 
by students on the Cattell test, and in the F, column the frequencies 
Which would be expected if the distribution of scores was a truly 
normal one. The procedure is as before, the smallest frequencies, at 


* Here the degrees of freedom are 3. The totals across and down the way are 
fixed, thus we 


need to know only 3 cell entries to complete the table, In the 
following example, there are 8 degrees. For though 11 pairs of categories are 
compared, 3 quantities are fixed—the number of cases, the mean score, and the 
Standard deviation. For further explanation of this rather tricky Concept of 
degrees of freedom, see Fisher (1930). 
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F Fe F— Fe eum 
1 1 А 
1 i} 322 0:6667 
6 4 
11 9 +2 0-4444 
22 19 +3 0:4737 
26 . 30 —4 0-5333 
35 41 — 6 0-8780 
47 44 +3 0-2045 
43 40 4:3 0-2250 
27 28 = 0-0356 
18 18 0 0-0000 
0 2 +1 0-1111 
0 2 => 1-5000 
250 250 5:0723 


the head and foot of the columns, being grouped together. Chi- 
squared has a probability of 0-75. We might expect as great a 
deviation, or greater, in three instances out of four by pure chance. 
Thus the approximation to normality is quite satisfactory. 


CHAPTER VI 
MEASUREMENT OF CORRELATION 


A very wide range of problems in educational and other branches 
of psychology depends on the study of relationships between 
variables. For example, examinations are set at the end of the pr 
mary school career, partly in order to test the work that the children 
have done, but also partly because achievement in English and arith- 
metic is supposed to be related to, or predictive of, future achieve- 
ment in more advanced work. The main object of applying ing а 
gence, or other aptitude, tests is to show probable ability in various 
fields which are related to the test performances. In many secondary 
schools the pupils are allotted to different classes for general уш 
for mathematics, and for foreign languages, since it is believed (wit ‘ 
much justification) that abilities in these three types of study are ne 
very closely related, that some may be good in general work, poor а 
mathematics, medium at languages, and so on. 

Several different methods, applicable to different kinds of data, 
аге available for determining the extent of relationship between two 
variables. Almost all of them yield indices ranging from 0 (no 
correlation) to + 1-0 (perfect correlation). By far the most frequently 
used are the product-moment correlation coefficient and the rank- 
order correlation coefficient. We will point out later the limitations 
of these two methods, and mention briefly some of the alternatives: 

Ex. 39.—The following marks were obtained by a class of 36 
children on examinations in history and geography. A glance W1 
show that there is some correspondence; that those who obtain 
high marks in one tend to get high marks in the other, and that low 
marks in one tend to go with low marks in the other; but that there 
are many exceptions to this trend. The state of affairs may be seen 
more clearly if the two sets of measures are plotted against one 
another. Fig. 21 is called a scatter-diagram or scattergram. The 
marks have both been grouped into classes of 2. History marks (Ху) 
are shown on the vertical axis, geography marks (X,) on the hori- 
zontal axis. Each pupil’s position in the two exams is shown by a 


y 
tick. For example, Alan's 16 and 12 are represented by a (ick 
Opposite the 15-16 mark in history, and below the 11-12 mark in 
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X X: 

History Geography X X. Xi X: 
16 12 10 5 8 6 
19 15 14 18 20 12 
17 11 8 6 17 11 
10 17 6 4 12 8 
18 14 18 20 7 2 
18 16 | 14 8 17 16 
15 4 112. 4 14 11 
17 13 18 14 14 9 
12 12 16 10 12 3 
16 8 6 4 | 16 1 
11 10 18 14 20 15 
14 9 12 11 17 6 


geography. Four pupils obtained either 17 or 18 in history and 
either 13 or 14 in geography, hence there are four ticks in this box 


or cell. 
GEOGRAPHY (X7) 


1-2 |3-4 5-6|7-8 олоји -12 
| | 
19-20 
17-18 gin n 
15-16|1 |! p 
E 


13-14 15 


TH ү 4 
9-10| AT 
~ и 


7-8 ! 5 


2 


15-14 5.160 во 20 
124 
J 
man 3 |1 | 


РА 


HISTORY (Xj) 


2 " 
Fic. 21.—Scattergram, comparing two sets of school marks. 


ticks tend to be grouped along 


It will be seen that the majority of 
here are some rather extreme 


the dotted diagonal line, although t 
exceptions; for instance, the pupil with marks of 16 and 1, and the 


pupil with marks of 10 and 17. Obviously the closer the correspon- 
dence between the two sets, the more closely would the ticks fall along 
the diagonal, and the fewer would be the exceptions whose ticks are 
far removed from it. Fig. 26 (p. 133) and Ex. 49 (p. 113) show the 
Scattergrams for а low correlation of + 0: 146 and a high correlation 
of + 0-799 (the numbers in each cell represent the numbers of ticks). 
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Only if each pupil got marks which exactly corresponded would ai 
correlation be perfect and the coefficient be + 1-0. Note, however, ai 3 
it is not necessary for the marks on the two variables to be du. 
for a perfect correlation to be obtained, since these marks may А 
arranged on different scales. In Fig. 21 the geography marks are 4 

a different scale from the history marks; 10 on the one corresponc A 
or is relatively equivalent, to 14 on the other. Consider another s 

stance: the weights and heights of a class of children would certain Д 
yield a very high correlation, although they would be measured x 
entirely different scales—pounds and inches. It is the closeness © 


а d ize of 
agreement of relative scores or measures which governs the siz 
the correlation. 


Very few correlations in 


educational or psychological investiga- 
tions approximate to 


+ L0. Usually there is only a general trend a 
high scores on one variable to go with high scores on the es 
trend which admits of many exceptions. Two tests of the same abili 2 
e.g. two similar arithmetic papers, may perhaps give a CNET. 
+ 0:90 or over. Tests or exams in different subjects, e.g. arithme DR 
and English, or history and geography, are more likely to giV 
moderate correlation coefficients of -+ 0-40 to + 0-70. 5 
Occasionally one may find negative or inverse соор. Р 
when high scores on one variable go with low scores on anot F 
variable. For instance, in an ordinary school class intelligence We 
scholastic achievement and age are often negatively related, ae 
the oldest pupils are rather dull, the youngest pupils very bright. 


З f à ze indices 
Negative correlations are expressed, just as positive ones, by indic! 
ranging from 0 to — 1-0, 


e 
Calculation of Correlations by the Product-moment Method (хот. 


1 
times called the Pearson-Bravais Method).—The fundamenta 


formula for the correlation coefficient, r,is: 


Here о, and c, are the S.D.s of the two distributions. x, represents 


& Score or measure expressed as a deviation from the mean of all the 
X, measures, 


хз is a measure expressed as a deviation from the mean 
of all the X, measures. Хах, is the product of a person's two measures, 
апа 2x,x, is the sum of all these products. 

Since the means are seldom likely to be whole numbers, the com- 
putation of Xxx, would involve very troublesome arithmetic, 
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Hence, just as in calculating S.D.s, it is usual to resort to the device 
of the arbitrary mean. Each of the quantities in the above formula 
then requires a correction. If x, is a deviation from an arbitrary 
mean, o; is no longer 


Zu (Bat Гк. 
but /2% 1 
| NE: (cae) 


(cf. p. 47). A corresponding alteration is made in os. 


DXX 


becomes 


XXa 2: 27x. 
ae = = x F Hence the full formula is: 


POA = (PENG 
(e) fe (т) 
This same formula applies if the original scores are used, i.e. if X; 
and Y, are substituted for x; and хь. The only difference is that the 
arbitrary mean is now zero, instead of being some whole number 
chosen close to the true mean. 

Statistical textbooks often provide modifications of this formula, 
or ‘patent’ correlation techniques which are claimed to be simpler. 
The present writer prefers the direct method given here, both because 
it makes use of concepts such as the true and arbitrary mean, o, and 
the correction required when c is calculated from an arbitrary mean 
— concepts with which the student should already be familiar—and 
because errors in computation are fairly readily detected. Moreover, 
a coefficient accurate to three places of decimals can be obtained by 
this method if the multiplication and division, and the finding of 
squares and square roots, are done with a 10-inch slide-rule and/or 


four-figure logarithm tables. 


r 


n 50, and when a high degree of 


accuracy is needed.! In the following table, columns Xs and X; list 
the original marks. As arbitrary means, 14 and 10 are selected. 


able, this method is generally 


* When an electri i chine is avail 
ectric calculating ma even if N is very large. 


Used with original scores (not an arbitrary mean), 
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X, х, Xi Xa xy x1 ХХ 
16 12 +2 +2 4 4 + 4 
ORES a5 +5 25 25 + 25 
17 11 +3 FT 9 І E. 
017 = 4 +7 16 49 Ж 
18 14 +4 +4 16 16 + 16 
18 — 16 +4 +6 16 36 224 07 
15 4 +1 —6 1 36 = 
17 13 ard к) 9 9 Еа 
(A —2 i» 4 4 m 
16 8 +2 —2 4 4 T 
11 10 —3 0 9 0 0 
14 9 0 =i 0 1 0 
10 5 —4 —5 16 25 + 20 
14 — 18 0 +8 0 64 0 
8 6 — 6 —4 36 16 ++ 24 
6 4 —8 —6 64 36 + 48 
18 20 +4 + 10 16 100 + 40 
14 8 0 =) 0 4 0 
12 4 —2 —6 4 36 -12 
18 14 +4 +4 Део + 16 
16 10 +2 0 4 0 0 
6 4 —8 —6 64 36 + 48 
18 14 +4 +4 16 16 3:16 5 
i ih —2 +1 4 1 => 
8 6 —6 —4 36 16 + 24 
20 12, 4-6 3-2 36 4 + 12 
17 11 55 zr il 9 1 se 3 
12 8 -2 =2 4 4 + 4 
7 2 —7 —8 49 64 + 56 
17 16 33 +6 9 36 + 18 
el 0 +1 0 1 0 
DEALS 0 = 0 4 0 
12 3 —2 —7 4 49 + 14 
16 1 +2 —9 4 81 — 18 
WG +6 +5 они 25 + 30 
17 6 +3 ad) 9 16 — 12 
36 + 61 + 72 549 836 + 466 
— 56 — 74 = 74 
=+5 = — 2 
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Columns x, and x; list the deviations of Ху and XY, from these 4’s. 
They are summed to give Хх; and 2x2. 
DIX S ү Они ME е 
oN з збы” 0-1389 Nise азба 0-0555 


Therefore the true means are 14-139 and 9-945. Later we shall need 
Су“ ү Бо Эрл Ха 
3863 and –4 х 
These come to +- 0:019, + 0:003, апа — 0:008 respectively. Note 
that three places of decimals are sufficient for these corrections. 
о | 2x, 
Neo aN 
The two columns on p. 96, ху? and x,?, give the deviations squared. 
These are summed and averaged. 


Special attention should be paid to the sign (+ or —) of 


2712105854998 2x11 1 830 9 ка 
No = 36 = 15250 Ne = 3g = 222 
_ [үү _ pee ошо 
Hence о; = | NUS (F) = 4/15-250 — 0-019 = 3:903 


oa = 23-222 — 0:003 = 4:819 

The last pair of columns gives the products of the deviations; the 
+ and — products are separated merely for convenience in summing. 
Special attention to the signs is needed. When x, and x, are both + 
or both — the product is plus, when one is + the other — the pro- 
duct is minus. Notice how the exceptional cases, such as the pupil 
with 10 and 17, make large negative contributions to 2x,x,, and so 
ultimately reduce the size of r. The pupils who get very high marks 
on both variables, or very low marks on both, such as those with 6 
and 4 or with 18 and 20, make large positive contributions to 2x2 
and help to raise r. 


ха 2392 ТЕС 
НАЗЕ 10-889. 


N 
Xa У к e 
The correction, I x AU is subtracted from this, but as its sign 


was —, it is in this case added. 
Zxx, Ха 7% _ 10:889 — ( — 0-008) = 10-897 
ы лы NI E ( ) 


All the corrections indicated in the second formula have been 
made, hence we can finally apply the first, or fundamental, correla- 


tion formula: 
а (И N 10:897 — + 0:579 
N (== < 4819 x 3:903 T 
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en the 
Thus the coefficient shows moderate correspondence between 


: m. 
two sets of marks, as we guessed from looking at the scattergra 


Product-moment Correlation between Tabulated Measire TR 
the measures are grouped into classes, the procedure is as 
identical, except that the computations are, of course, performe Ба 
all the measures in any one class at one time. We are no longer co d 
cerned with the deviations of single measures from their M. 
their products, but with deviations of classes from the mean c d 
and their products. At no time is it necessary to translate the resu 


ae the 
back into terms of the original measures, by multiplying by ¢ ( 
size of the class). 


Ex. 41.—The two sets of measures are first plotted in a сат 
gram, which should contain, if possible, between 11 and 19 ај 
a similar number of columns. There is no need to have ide asa 
numbers of rows and columns. In Fig. 21 (p. 93) the data are 
matter of fact, rather too Coarsely grouped, since there are c 
rows and 10 columns. Over-coarse grouping has the effect of re nd in 
slightly the size of ғ (a formula for correcting this may be fou each 
advanced statistical textbooks). By adding up the frequencies in 


h ноп of 
Tow, and in each column, we get the two frequency distribution 
grouped measures, 


Fig. 21 is reproduced below with the гелосу 
distribution of xı Measures (F,) written at the right-hand aoe at 
Tows, and the frequency distribution of x, measures (Fp) Ар tions 
the foot of the columns. The frequencies in both these distribu 
should be checked by making sure that they add up to N. 


2= 0—4 —8 —2 1 041424844+45/F, 
Ee] EGE j 3 
+2 | | 2 |4| 2 1} 10 
+1 aji | Jm nent | 5 
0 [2 1 cal 1 5 
E IE a | 3 
—1 2 1|1|2 
аец | | | Д 
— 9 S |1 4 1 i 
== m 2 
1 
TA m al | i 
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An appropriate arbitrary mean is now chosen for the x, classes 
and for the x, classes, and the classes above and below these means 
are numbered + 1, + 2,...— 1, — 2, .. . etc. These numberings 
are written at the left-hand side of the rows, and above the tops of 
the columns. Standard deviations are now calculated in the usual 
way. 


x,andx. | Fi Бе. {ТОБ Рух» Exi 
+5 Е | +5 
ahs 2 8 
-- 3 3 din «9 12 27 
+2 10 4 | 20 8 40 
+1 5 Ju 5 7 5 
0 S 3 | +34 +40 0 0 
= 6 Ae "Ge 4) 6 4 
—2 2 4 | 4 8 8 16 
—3 3 5 9 15 27 45 
—4 2 2. | 8 8 32 32 
36 36 = 27 — 35 145 213 
| DEK Буха 
| =+7 = +5 
Бр FT ле 
Nus — 5603 + 0:1944 CE ") = 0:0378 
ОЕ SES à EF , 
—N = 36 = 601389 (F5) = 0:0193 


А 5 2X2. Е 5 у 
The correction EU X SML which will be needed later in the 


numerator of the correlation formula, is: 
-+ 0:1944 x + 0:1389 = + 0:027 
ХЕ? 145 
N - 36 


су = ү/4:028 — 0-038 = 1-997 
Dyce? 9219. 
зе 5:917 
o, = 4/5917 — 0-019 = 2:429 
. Note that these S.D.s are approximately one-half those obtained 
in the previous computation of r (namely 3-903 and 4-819), the 
reason being that we have grouped the original measures into classes 
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of 2. They are not exactly half because the grouping has produced 
some slight distortion of the distribution. be 

Now to obtain the products, the frequency in each cell must i 
multiplied by its x, value and by its x, value. For instance, the entry 


5 ү = + 18 toe 
2 in cell (x, = + 3) (х. = + 3) contributes 2 x 3x 3— + 1 
ZFx,x, The entry 1 in cell (x, = — 2) (x, = + 4) contributes 
1x4x—22-—98. 
Cells corresponding to Frequencies Total Елда 
Хаха this value of хух in these cells F's 
Oo) sme SD se 0) 1 
0 0 i " 
—1 0 
0 — 1 2 д 
0 + 1 1 
0 +4 1 
Кыр х= +1 хә == +1 1 y 2 +2 
= il — 1 1 4 
+ 2 а ац) Ку 2 ои 55 
+ 3 х= +3 X= +1 1 aF 9 
== fl —3 2 
+ 4 2 х= +2 4 5 ait, 90) 
= —2 1 
"EG ж === 2 X= 4-3 2 4 + 24 
= —2 2 
+ 9 х= +3 х= +3 2 2 А910 
ПОЕ 2: 52155 1 ju 
112 жм=—3 Xa == 4 1 3 +36 
E = 2 
4- 123 
T 
i yl х= +1 1 1 у з eh cl 
— 1 +1 2 3 
me dt х= — 3 1 euo 
= х= +2 mmm 1 \ 2 PA 
anal —4 1 = 8 
8 ху= —2 х= 44 1 EL 
2222 
Nc 


хха = + 101 


It is unnecessary, however, to consider each cell in turn. Instead 
we take each possible value of хух, in turn, as shown in the accom- 
panying table. All the cells for which either x; = 0 or xs = 0 contri- 
bute zero to YFx,x». It will be seen that the total frequencies in such 
cells is 7. The total frequency in the cells where хух» = + 1 is 2, 
hence + 2 is entered in the last column. 
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As soon as the student becomes familiar with the method he will 
find it unnecessary to write out the second, third, and fourth 
columns of such a table. Only the first, and the last two are needed. 
The simplest plan is to draw a cross through each of the cells for 
which x,x» = 0, and to count up the frequencies in these cells while 
doing so, and then enter the total in a table thus: 


хіх Е FXX: 
7 0 


+1 2 +2 


Next cross out the cells where хух = + 1, and enter in the table, 
and so on. At the end, sum the F column, in order to make sure that 
none of the entries in the cells have been omitted. The last column 
is summed to give ZFxjxs, and the rest of the working is as before. 
БЕхуху 
N 

-+ 0:027. 

2:779 


2806 — 0027 = 2779. 7 = 5-499 x 1991 — 4- 0:573. 


= um — 2.806. The correction, to be subtracted, is 


It will be seen that in spite of the distortion due to grouping the 
scores into a rather coarse scattergram, the result is nearly identical 
with that obtained from the ungrouped measures, namely + 0:579. 


There are, of course, numerous possibilities of error in working out 
product-moment correlations. All additions should be checked up 
and down; multiplying, dividing, squaring, etc., may be done with 
logarithm tables and confirmed with a slide-rule. But a little practice 
will enable the student to predict r fairly closely from consideration 
of the scattergram, and so to judge whether or not the value he 
reaches is likely to be correct. 

Ап Alternative Method of Calculating Correlation from the Scatter- 
gram.—We will describe one of the various alternative methods, 
which may be found simpler to use than the above, direct, method. 
It eliminates the collection of the contributions to ZFx;x; from each 
cell, and substitutes the calculation of two additional S.D.s, which 
we shall call оз and о,. Actually, one of these is sufficient, but the 
other acts as a valuable check. The formula for r is: 
gy! — оу? — СЕ оү? X CRUS ац? 

" 20103 


ея “20,09 ЗА 
Here с, and c; are the S.D.s of the F, and F; distributions as before. 
M.A.—8 
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The other two distributions are obtained by counting the frequencies 
along the diagonals. 


Ех. 42.—Applying this to the scattergram in Ex. 41: for оз We 
start from the top right-hand corner and proceed to the botto 
left-hand, counting all the entries along the diagonal at right ang 2 
to this line as we go. On the diagonal running from (xj = + ) 
( = + 3) to (x; = + 2) (xs = + 5) there is one entry. The n 
diagonal runs from (x, = + 3) (2=+3) to (х= p 
(а = + 5); this takes in 2 entries. The next also takes in 2, t is 
next 1 + 4+ 1 = 6, and so on. The complete distribution, Fs, 1 
given in the following table. The corresponding distribution d 
along the diagonal from top left to bottom right, is also listed. 


Arbitrary means are chosen and the S.D.s worked out in the usual 
way. 


E. Е, dT Fad Fid | Fad? F,d* 
1 ST 7 49 
2 +6 | 12 72 
УД Гы Буны! 10 5 50 25 
бл Ж ы EA 24 8 96 32 
DON лд 6 0 18 0 
2 A idu» 4 s 8 16 
2 6 +1 | 2 6 2 6 
STO "R65. +27 0 0 
3З; Мани 01 3 8 3 8 
ZU TN eS 4 4 8 8 
Ма Лс Зз || 3 3 9 9 
3 1 = 4 12 4 48 16 
2g OD 065 10 0 50 0 
Osa 1684 0 6 0 36 
3 —7 21 147 

—53 — 25 560 156 
ХЕ,4 _ 65 — 53 XF,d? _ 560 - 

N 7 359 = 0333 N= 36 = 105556 


vid оз? = 15-556 — (0:333): = 15-445 
Similarly o,? = 4-333 — (0-056)? = 4-330. 


We already know су? and cs? to be 3-990 and 5:898. Hence: 
15-445 — 3-990 — 5-898 

= eee сузе 57. 

CS OEY 5001120 и ка С 


The version using c, instead of оз gives the same result, which is 
also identical with that obtained on p. 99. 
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Rank-order Correlation.—A different method for determining 
correlations is available for data arranged in order, for example the 
orders of merit of a school class on two examinations. When the 
number of cases is greater than about 30, the figures involved become 
troublesomely big, but with a small number the method is much 
easier to use than the product-moment method. Often, therefore, sets 
of scores are turned into rank orders and the correlation between 
these orders is computed. The formula is as follows: 


г=1— === 


where d is the difference between each person’s two ranks. Such a 
correlation is often indicated by the Greek letter o (rho), instead of 
by r. 


Ex. 43.—The second and third columns of the following table 
give the marks of eleven pupils on tests X; and Ху. The next two 


Pupil X; Xs Ranks d а? 
d 65 75 6 4 2 4 
B бә 66 1 7 6 36 
С 64 58 7 9 2 4 
D 78 82 3 1 2 4 
E . 60 58 8 9 1 1 
E 70 80 5 24 24 6:25 
G 83 80 2 24 i 0:25 
H 55 71 10 5 5 25 
I 50 58 11 9 2 4 
J 72 55 4 11 7 49 
K 59 68 9 6 3 9 
142-5 


columns show the ranks, pupil A being 6th on Xj, 4th on Ху, and 
so on. The d column gives the differences 1n rank, regardless of sign. 
These are squared and summed. Substituting in the formula: 


6 x 142-5 Я 
== = ДЕЛ 1 — 0:648 = + 0:352 

Clearly the bigger the differences in rank positions, the bigger will 
Zd? be, and the lower the value of r. If 62d? is greater than N(N*— 1) 
the correlation will be negative. 
Now there is no difficulty in turning scores into ranks when, as in 
X; in the above example, every person gets a different score. But 
When two (or more) persons tie with the same score, they then divide 
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the two (or more) ranks between them, as in X,. Thus pupils F an 
G must share 2nd and 3rd place, and are both called 24, haat ү 
next highest pupil, A, is called 4. Again, C, E, and I share пее 
9th, and 10th Positions, and are all called 9, whilst J, the next hig i 
is 11. Such sharing is undesirable in that it distorts, and gener 
decreases, the size of the correlation. It follows, therefore, that 


$ eat 
rank-order method should not be applied when there are a gr 
many ties. 


OTHER METHODS OF MEASURING CORRELATION 


: indicated 
The nature and the main uses of other methods will be indicate 
briefly, but they will notb 


may be found in more a 
Spearman’s Foot- 


Tange only from + 1- 


: two 
often diverge rather Widely from those obtained by the ee rtant 
methods. A new and easy method, which possesses certain imp: 


А by 
Statistical advantages over the tank method, has been described 
Kendall (1948). 


Correlation Ratio.—This 
of by r. We saw in Fig. 21 (p. 


GA Т 
ong a straight diagonal. This is referred to as en 
€n the regression is non-linear, that is, when ү a 
tend to be grouped round a curve, а product-moment correla s 
fails to show the closeness of relationship, and the correlation ra 


А : ion аге 
15 used instead. For example, when tests of motor perseverati 


ith low scores are weak in character, whereas those sa 
medium scores get the best character assessments. This type БО 
relationship is illustrated in the accompanying scattergram, Fig. i 
ontingency.— Ordinary correlation methods canno 


Nor to grossly non-n 
the association between, say, 
affiliation and socio s А E 

Squared technique. The individuals are first classified into convenient 
numbers of categories for each variable, and a cross-classification 


— дою 
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CHARACTER RATINGS ——+ 


MOTOR PERSEVERATION SCORES ——> 
Fic, 22.—Scattergram of non-linear correlation. 


table—similar to a scattergram—is drawn up. We then assume that 
there is no relationship between the variables and compute, on the 
basis of this hypothesis, how many individuals would be expected to 
fall within each of the cells. If the actual frequencies diverge from 
these expected ones to an extent which cannot possibly be ascribed 
to chance errors of sampling, this provides evidence of a relationship 


between the variables. 

Ex. 44.—Records of the times spent on television viewing over 
One week were kept by 293 grammar school pupils, who were also 
classified according to the number of years their families had had 
television sets. Does the following table indicate that pupils from 
families which have owned sets for a long time view less than 


relatively recent owners? 
Length of Ownership 


Less than 1 to2 З years | 
] year years and over | 
Hours of $4 14 31 4| | 86 
Viewing A Rat 14 49 66 | 129 
m1 week 0 — 3. 5 21 52 | 78 


33 101 159 293 
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Note that there is no need to have the same number of categor i 
for both variables. Quite broad ones are generally adopted, so tha 
no value of F, falls much below 10. 

Now if there were no relation between the two variables, Xi 
might expect the 86 pupils with 8 + hours viewing to be distributes 
as regards their ownership periods, in the proportions 33 : 101 : 159. 


33 
That is, the F/'s for the first line of the table should be: 86 X 593* 


86 A: === = 9-7, 29: 46-7. T other entries 
x 293 and 86 x 293 9 > 2 6, and 46:7. The other 


the F, table are similarly calculated, and the F — F, table does 


Д x rs 
suggest that a certain excess of short-owners view for more hou 
than long-owners. 


97 296 467 | 86 | 
145 445 700 E- | E 


88 269 423 | 78 | +43. 414 27 
= | —05 +45 740 
33 101 159 | 293 | —38 —59 + 
"TABLE OF Fes. E or (Е — Fe). 


otnote, 
002 0-46 023 are 4 degrees of freedom (d; I f 0-07. 
L64 129 232 БА ны ЛЫЧ БАВЕ арена 
[x Ол Such odds are not high, but t a 
(Е E suggestive of a slight negative © Al 
TABLE OF 5—.— lation between amount of viewing 
у. length of ownership. 

This chi-squared can be converted into what is called the 97 
tingency coefficient, С. A further modification yields a coefficien 

which is more closely comparable to a correlation, гу. 


B M —, : : н еге 
191 007 0:70 | Chi-squared = 8-55, and as 
| 


сае ay en айы ые 

dx m Ee N+ xt — df. 
Here, d.f. refers to degrees of freedom, and k provides a correction 
for small numbers of categories. 


No. of k 
Categories 
2 0-798 
3 0-872 
4 0-923 
5 0-949 
6 0-964 
8 0-979 
10 0-986 
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In our example, the pupils are grouped into 3 categories on each 
variable, and kı = А = 0:872. Thus: 


ec Em DUE 8-55 — 4 
293 — 8:55 * 087234 301-55 — 4 
= 0-172 = 0-163. 


Considerable caution is essential in interpreting contingency, 
since strictly it is a measure of divergence rather than of relationship. 
Suppose, for instance, that our original table of F’s had been: 


| 66 14 49 | 129 


Ран ита КЕШЕА 
5 5 9 m 


| 159 33 101 | 293 


С and гу would be the same as before, although the relationship is 
quite different; for now it is the middle-ownership group which 
views most, and the short-ownership group which views least. Thus 
contingency coefficients never have a negative sign even though, as 
in our example, they happen to express a negative correlation. 


Two by Two Tables.—Ex. 45.—Taking the data from Ex. 39, 
suppose that we do not know the pupils’ actual marks but only the 
categories ‘top half’ and ‘bottom half’ on the two examinations.” 
The following table may be drawn up: 

GEOGRAPHY 


Bottom Half 
10 or under 


Top Half 
11 or over 


Top Half 
15 or over 


HISTORY 


There are several ways of determining a correlation for such data. 
Two that have often been applied are Yule’s coefficient of association 
(О), and his coefficient of colligation (о). 


1 1 | 

T There is no necessity for the proportions to be equal, as they are in the 

рещ instance. But coefficients аге more reliable the nearer the split is to the 
n. 
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_ ad — bc 
— ad — bc 


where a, b, c, and d are the frequencies in the four cells. Substituting, 
wefind Q= + 0-742. Neither of these coefficients is at all closely 
comparable to a product moment correlation, and contingency s 
therefore somewhat preferable. For the same table, x? = 7:111; 
hence C = 0-406, and г, = 0-598. The latter figure compares quite 
well with the original product-moment ғ of + 0:579. ~ 
A still more suitable technique is known as tetrachoric correlation 
(г). This cannot easily be calculated, but it can be read off very 
quickly from specially prepared graphs (cf. Chesire et al, 1933): 
Note, however, that r, can be used only when the variables to be 
compared (although both expressed in the form of two categories) 
can be regarded as normally distributed. It therefore applies to oUF 
examination marks, but could not be used for sex, or eye-colour, 
and the like. In the present instance, r, = + 0:642. The discrepancy 
between this and our r of + 0-579 is probably due to the sma 
number of cases and the irregularities of their distributions. 
Biserial r.—NWhen one of the variables to be compared has ae 
accurately measured and has yielded a normal distribution, but e 
other (though believed to be normally distributed) can only be 
expressed in the form of two categories, biserial r may be used. For 
instance, if we wish to compare intelligence-test scores with success 
or failure in a scholarship examination, we may not know the 


examination marks, but we do know the intelligence scores © 
Scholars and non-scholars. 


Ex. 46.—Assume that the pupils of Ex. 39 are divided into ms 
groups according to their history marks—the top 17 and the botto. 


19. The mean geography marks of these groups are 12:23 and 7:89 
respectively. 


Тыз = 


i Maly Ра MM, р 

с 2 с 2 
М; and М, are the means of the contrasted groups. In the alterna- 
tive formula, M is the mean of the total group. с is the S.D. of the 
total group, namely 4-819. p and q are the proportions falling into 


the groups, = and = 2 18 the ordinate of the normal curve согге- 


sponding to p (obtainable from statistical tables); in this instance 
2 = :3977. 
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re 12:23 — 7-89 s 0:2490 

= 4:819 0-3977 

Though this result agrees well with our product-moment 7, Fis 

tends to be somewhat unreliable and easily distorted by irregularities 
in the distribution. 


= + 0:563. 


THE PROBABLE ERRORS OF CORRELATION COEFFICIENTS 

A correlation coefficient is, like any other numerical psychological 
result, liable to vary on account of sampling errors, especially when 
the sample of persons, whose scores are inter-correlated, is of small 
size. The formula for the P.E. of a product-moment coefficient is: 
_ 0:6745(1 — г?) 

c JN 
Applying this to our r of + 0:579, where N = 36, we get P.E. = 
+ 0:075. It is usual to write the P.E. after the coefficient, thus: r — 
+ 0:579 + 0:075. The Р.Е. of a rank order coefficient is slightly 
greater than that of a product-moment one, and is usually calcu- 
lated by the formula: 
0-7063(1 — г?) 

P.E.p = VN 
For the determination of the P.E.s or S.E.s of the other coefficients 
mentioned above, the reader should consult more advanced text- 
books, such as Kelley (1924) or Guilford (1936). 

The connotation of the P.E. of r or p is similar to that of an 
average or difference (cf. Chap. V). Thus + 0:579 + 0:075 signifies 
that if the same history and geography papers Were set to much 
larger numbers of similar children, and the marks inter-correlated, 
the true value of r so obtained would be as likely as not to lie 
between - 0:654 and + 0:504 (ie. г + 1 X P.E.andr — 1 x P.E.), 
but that there is a 1 in 20 probability that it might deviate as widely 
as + 0-804 or + 0-354 or more (3 x Р.Е.). ; 

_ Just as a difference between averages should be two or more times 
its S.E. to be accepted as statistically significant, so à correlation 
should be 3 times its P.E. before it is regarded as showing а real 
relationship between the inter-correlated variables. A coefficient of, 
Say, + 0-30 + 0-10 is on the borderline of significance. For if the 
E value of r was zero, the scores of a limited sample of persons 

Tu yield + 0-30 once in 44 times. 

Our previous illustration + 0:579 is much greater than3 x P.E. 
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Yet, as we have seen, this value is far from trustworthy, the we 
being the small size of N. Most investigators consider a corre e 
calculated from less than 25 cases to be almost worthless Е m 
of its low reliability, and they prefer to have at least a ћипдге Sr 
before they base any important conclusions on the results 
experiment involving correlations. 


S : of the 
When numbers are small, it is better to calculate / instead 
P.E. 


patVN—2 
т ient barely 
For r = 0:30 and N = 36, 1 = 1-922. Thus this coefficient 
teaches the 0:05 confidence level. 3 liability 
Fisher's z Method.—The P.E. method of indicating the ne a 
of r is inaccurate, not only when N is small, but also when r is 1а 


ich is 
exceeding say 0-40. Fisher (1930) presents a better method whi 


5 : into a new 
generally adopted nowadays. It consists in transforming r into 
quantity called z, whose 


true S.E. is easily calculated." APP 
this method to our r = + 0:579, we find that the limits within limits 
true r is as likely as not to lie are almost identical with the 0.762 
derived, above, from P.E.,, but that the 0-05 limits should be + 
and + 0-310 instead of + 0:804 and + 0:354. Iso pro- 
Combining and Comparing Correlations.—The z method 5 
vides a useful way of combining two or more correlations. 
e 
Ex. 47.—Suppose that another class of 43 pupils had кел 
history and geography papers and yielded a coefficient The pro- 
+ 0:082, what is the best estimate of the true correlation? 
cedure is to 


ute 
determine z, and z, for the two r's, and then to comp 
the weighted average, 


aN: — 3) + z(N, — 3) 


N, + N,— 6 qu 
Translating back from the average z gives us our answer, 
+ 0:512 + 0:057. 


It is often required to find whether a correlation in one 21000 in 
appreciably different from that in another group, or whether i 
difference can be ascribed to errors of sampling. When the number: 
of cases are large the stock method may be used, 


ү ле [о +0 log; а — nt. The SE. of 2= у 


1 
N —-. Fisher 
N Fis 
provides tables of z for various values of r. 


== 
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оа —4A/S.E.r,? + S.E.rg?. 


Ex. 48.—What is the significance of the difference between 
- 0:579 + 0:075 and + 0:450 + 0-082? The S.E.s of these co- 


efficients are the (сете ie. 01108 and 01216. оу = 0-1645. The 


difference is 0-579 — 0:450 — 0-129, which is not as large as its own 
са. A difference as large as or larger than this would occur 43% of 


times as a mere matter of chance. 


As the numbers are small, Fisher's z method should preferably be 
used. This consists in computing 21, z;, and their S.E.s, and applying 
the stock formula to these S.E.s. The difference in z's is + 0:1763 
in the above example. S.E.z, = 0:1741, S.E.z, = 0:1581. o; = 
4/0-1741* + 0:1581° = 0:2352. Here the difference is 0:75 times its 
oy, and it might occur 45% of times as а mere matter of chance. 
Note that these methods do not apply to the difference between two 
correlations in the same group of people (cf. Lindquist, 1940). 


CHAPTER VII 


INTERPRETATION AND APPLICATIONS OF 
CORRELATIONS 


A CORRELATION coefficient is never an end in itself, but always 8 
means towards the proper interpretation of numerical data. By 
studying the correlations between tests of various abilities it 5 
possible to analyse scientifically the nature of such abilities. This 25 
one important application, which we shall consider in Chap. 2 
But the primary value of a correlation is that it enables us to E 
predictions. If abilities X, and X, are known to be inter-correlate A 
then we can use measures of X, to predict scores on X» and bee 
versa. For instance a pupil’s probable scholastic success may : 
forecast from his performance on an intelligence test. This сап 
of course, be done with complete accuracy, but the correlation © 


: d its 
efficient will also tell us how accurate is the forecast, or what 219 
limits of error. 


_ Ex. 49.—The following scattergram shows the Intelligence QUÀ 
tients of 440 children and adults on the Stanford-Binet scale, "5 
their I.Q.s on the Vocabulary test from this scale. (The group og 
category 140 includes all testees with I.Q.s around 140, 1.6. тапа 
from 1354 to 1454; similarly with the other groups.) we 
relation is quite a high one, namely + 0:799 + 0-012. Неве est 
would be justified in testing people only with the Vocabulary Q 5! 
and predicting from their results what would be their Binet fe 


Let us see how to make such predictions, and how 200 they 
would be. 


Suppose a child to obtain a Vocabulary 1.0. of 120, his Binet 10. 
may, according to the scattergram, lie as high as the 140 class, 1.6. 
up to 145, or as low as the 100 class, i.e. down to 96. But presumably 
its most probable value will be the average of the column: 


ха E, 
140 3 
130 12 
120 16 
110 9 
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X, = Vocabulary IQ. 


Average 

X. 
12 130-1 
27 121-5 
48 115-6 
71 107-6 
100 99:3 
96 91-2 
54 82-4 
25 "40 

65-7 


X, = Binet I.Q. 


6 5 440 


Average 
X, 69:2 |75-0|88-0|91-9|99-9]109-9/118-5 121:6|128:3|138:0 


This works out at 118-5. Similarly for a child whose Vocabulary I.Q. 
is 70, the most likely value of his Binet I.Q. is 75-0, though it may 
lie anywhere between 56 and 95. 

Along the bottom of the scattergram are given the averages of the 
columns, i.e. of the X, measures which most probably correspond 
to the X, measures at the heads of the columns. Fig. 23 is a graph 
of these figures. It will be seen that the points on the graph lie 
approximately on a straight line which runs through X, = 100, 
X, — 100. That they do not all lie exactly on the line is due merely 


to irregularities in the data. For though the numbers involved in our 


example are large, they are still not large enough to yield really 
smooth normal distributions, or regularly spaced averages. 

Now as we approach the extreme values of Ху, the corresponding 
values of X, tend to lag behind; 118:5 is smaller than 120, 138 is 
much smaller than 150. Similarly with values below 100, 75 is not so 
low as 70, 69 is much less low than 60. This is an important and 
characteristic feature of all correlations. It is known as the regression 
of X, towards the mean. The line drawn through the points on the 
graph is called a regression line, or а graph showing the regression 
of X, on X,. 

Suppose now that the correlation is a perfect one. The scores on 
з oM would be identical, and the slope of the regression line 

е 45°, Next suppose that there is no correlation at all 
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| 

1 | 

| 1 | | І 1 | о 150 

50 60 70 80 90 100 по 120 130 14 X 

ho obtain 

FiG. 23.—Graph showing the most probable values of X, for testees W 

each value of Х,. {* 
between X, and X,. It would t 


zo m redict 
hen obviously be impossible to p 
Scores on one variable 


ive 
from those on the other. If we took di ; 
columns of the Scattergram and averaged them, we should S of 
all these means lay close to the mean of Ху. Whatever the Ü E 
the Vocabulary IQ., the average Binet I.Q. would be EA E. 
100. In other words, the regression line would be ESTHER 1086 
these two hypothetical cases we can now see that the Penn As 
of the regression line is to 45°, the higher the correlation; t senile 
it is to horizontal, the lower the correlation. Furthermore, W! zee 
Correlation is perfect, there is no regression towards the E nase 
each X, value is as big as each X, value. But when the a = p 
Zero, the regression is complete, since whatever the value o: A tie 
value of X, is always the mean. Thus the greater the regression, 
rrelation approaches zero. 
CHER a uS previous paragraph that the X, an Me 
measures are expressed in equivalent units. If the чокон о a 
two distributions are very different, then the slope У t 3 regressi ја 
line corresponding to perfect correlation would not, o; course, 
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45°. However, as shown in Chap. IV, measures can be turned into 
equivalent units if they are expressed as deviations from their means, 
and divided by their standard deviations, i.e. as 33 and 22. If this is 

оү O5 
done the slope of the regression line is identical with r. When the 
slope is 45°, am = 1:0, that is perfect correlation. When the slope 


ile 
Хајао; 

The slope of the regression line plotted in Fig. 23 is 0:80. оз and 
a, are 17:49 and 17-84 (i.e. the Vocabulary I.Q.s are slightly more 


spread out than the Binet I.Q.s). Hence r — р = 0:815. 


= 0-0, that is zero correlation. 


is horizontal, 


This is very close to the figure already quoted for the correlation, as 
determined by the product-moment method. As a matter of histor- 
ical fact, r was originally determined by plotting the regression line 
and measuring its slope, before statisticians invented the product- 
moment method. The latter method is always employed now because, 
as we have seen, the regression line obtained by averaging the 
columns is somewhat inexact, even when N is large. It is more useful 
to reverse the historical procedure, to calculate оу, оз, and r, and 
from them to determine the regression line. If I- =r, then 
271 
х= rx, 2, This is the algebraic equation of the regression line, and 
та. 
it can be used for predicting any individual’s most probable x, score 
from his x; score. Alternatively, if we wish to predict in terms of 


original measures, instead of measures expressed as deviations from 


their means, the formula becomes: 
(X, = М) = r(G – Ma) 


Ex. 50.— What is the most probable Binet І.О. corresponding to 
a Vocabulary 1.0. of 140? According to the averages listed at the 
foot of the scattergram, the answer is 128-3. But this is based on only 
6 cases and is therefore very unreliable. Applying the formula: 
17-49 
X, — 100 = 0:799 (140 — 100) x 1784 


X, — 131-3 is the correct answer. 
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Regression of X, on X,.—So far we have dealt only with the вач 
diction of Ху, knowing X,. The opposite is equally possible. On E 
right-hand side of the Scattergram in Ex. 49 are listed the Rr E 
values of X, which constitute the best estimates of the Уосађи A 
I.Q.s of persons with various Binet I.Q.s. Here, too, there is го НЕ 
towards the mean. 
probable Vocabulary го, 
corresponding to а ME 
Regression Т.О. of 120 is 115-6, ot 
SX ;nX for a Binet ТО. of 7 Ў 
is 74-0. A second grap 
may therefore be drea 
showing the regression E 
X, on X, Fig. 24 sho У 
the two lines diag а 
matically. If the POLES ae 
was perfect the two lin 5 
would coincide, both T 

ing a slope of 45* (provide 

Regression always, that X; and х,а | 

ЕН measured in сату од 

units). If the correlation ХЕ 

zero, the second line Wou 

Regression be vertical, as in Fig. a 
of Ron X3 since the only possible 

timate of Vocabulary І.О: 

from Binet І.О. would be 

100. With an intermediate 

correlation, such as our 4r 

ie ae 0-799, the second line 

ia ар ен КР, makes the same slope with 

the vertical as does the first regression line with the horizontal; 

and the two cross at the means of X, and X,. Thus the ET 

braic equation of the second line, which is also the formula fo 


x Regression 
! of X? on 
X, 


X; 
Ес. 24.—Regression lines when r — 
+ 0:799. 


: . s, 
predicting X, from X,, will be: у= га, or, in terms of score 
а. 
(X, — М) = (X, — М): 


Ех. 51.—What is the probable Vocabulary І.О. of a testee with 
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Binet 1.0. 80? X, = 100 + 0-799 (80 — 100) x m 83:7. 


The estimate based on the scattergram was 82-4, which agrees 
quite closely with our answer. 


P.E. of Estimated Scores.—The regression equations will tell us 
the most probable estimate of X, from Х,, or of X, from Ху. But it 
is obvious from the scattergram that the actual X, may fall within 
quite a wide range on either side of the estimated X,, and that 
similarly an estimated X, is only the average of a wide range of 
possible X,’s. The ranges may be determined from the formule: 

P.Evestim.x, = 0:6745o,V/1 = r° 

P.Evestim.x, = 0°674509\/1 — rê 
Applied to Exs. 50 and 51: P.E.estim.x, = 0:6745 x 17:494/1 — 0:799? 
= 7-1, P.Eestimx, = 73. 

The P.E. or S.E. of an estimated score has a similar meaning to 
those already considered, as may be shown by the following example. 
Consider a very large group of persons whose Binet I.Q.s are 110, 
and whose estimated Vocabulary I.Q.s are 108-1 + 7-3. Then one 
half of them should have Vocabulary I.Q.s lying between 115-4 and 
100:8, and all but 1 in 100 should lie within a range of 2-58 x S.E., 
i.e. about 4 x P.E., or between 136 and 80. 

Actually we already know the Vocabulary L.Q.s of 71 persons 
with Binet 1.0.5 round 110. This is not a very large group, but it 
should serve to test out our prediction. The scattergram is too coarse, 
but looking back to the original scores it appears that 37 of them, 
or 52%, had Vocabulary I.Q.s between 115 and 101 (inclusive), and 
that the two extreme Vocabulary I.Q.s were 138 and 80. This agree- 
ment is quite close. 

It is important to note that the P.E. of an estimated score de- 
creases as r increases. If the correlation were perfect there would be 
no error at all in estimating X, from Xz, or X; from X,. The lower 
the correlation, the less accurate will be the predictions. 


Ex. 52.—As an additional illustration, take the results from the 
36 history and geography papers. Here N is far too small for accurate 
regression lines to be obtained by averaging rows or columns. But 
by the use of the formule given above, probable geography marks 
can be estimated from history marks, or vice versa. For instance: 
what geography mark (X;) corresponds to a history mark (X,) of 7, 
and what is its P.E.? 

МА—9 
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M, = 14-139 M; = 9:945 
а; = 3:903 о, = 4:819 
r= 4 0:579 


“81 
(X — 9:945) = 0:579(7 — 14-139) x 3:903 


+ 
o 


X, = 9-945 — 5:103 = 4:84 
P.E.estim.x, = 0:6745 x 4-8194/1 — 0:579° 
= 2-65 
Therefore X, = 4-84 + 2:65. 


Here the correlation is only moderate, hence the estimate is poor 
in reliability. All we can claim is that a pupil with a history mark of 
7 would be very unlikely to get a geography mark as high as 15 
(X; + 4 x P.E.), and would most probably get round about 2 to 7. 


INTERPRETATION OF CORRELATIONS 

In the light of the above discussion it is possible to give. a much 
more precise meaning to the conception of correlation than here- 
tofore. A correlation does not mean, as is sometimes supposed, the 
percentage agreement between the correlated variables. In the 
Scattergram on p. 113, for example, only 374% of the testees get the 
same 1.0., or rather the same class of I.Q., on Binet and Vocabulary, 
although the correlation is + 0-799. It is possible to interpret a 
coefficient in terms of the number of factors or elements which are 
common to the two variables, which therefore cause them to cor- 
relate with one another, but there is little practical point in so doing. 
What a correlation does indicate is the amount of reduction of error 
in predicting scores on one variable from scores on the other variable. 
When r is zero, the error is at its maximum, namely 0:6745о. Pre- 
dictions based on X, might fall anywhere within the whole range of 
X; scores, and would be no more accurate than drawing X; scores 
out of a hat. With a correlation of 1:0 the error is reduced to zero. 
The extent of reduction of error for intermediate r’s is shown in the 
following table. The second column lists values of 100 (1 — 4/1 — 7°), 
and so represents the reduction in percentage terms. These figures 
are sometimes referred to as the forecasting efficiencies of the rs, Tt 
will be seen that the forecasting efficiency only becomes high enough 
to be of great practical value when г is in the neighbourhood of 0-8 
to 0-9 or more. Great caution should be exercised therefore before 
predicting, say, success at a job from aptitude tests, since these 
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usually give correlations of only + 0:4 to + 0:6 with what we wish 
to predict. Such deductions are only 8% to 20% more accurate than 


pure chance guesses. 


Reduction of error, or fore- 


r casting efficiency % 

0-00 

0:10 0-5 
0-20 20 
0-30 46 
0-40 8-4 
0:50 13-4 
0-60 20-0 
0-70 28-6 
0:80 40:0 
0:90 56:4 
0:95 68:8 
1:00 100:0 


Analternative method of interpretation is suggested by McClelland 
(1942). Suppose we are selecting the top 20 of a group of 100 can- 
didates, either for secondary schooling or for a job, on the basis of 
their intelligence test or selection battery scores. We know that the 
test gives a correlation of r with later success. What proportion of 
candidates is correctly allocated, or how many do we wrongly select 
and wrongly reject? The position for various levels of r is shown in 


the following tables. 


|Success- Unsuc-| 
| ful later cessful | 


Pass test | 20 о | 20 4 16) 20. 14 6 20 
шн |p о (8077 1808 «16, ИЕН "ВО e YE 80 


| 20 80 | 100 20 80 | 100 20 80 | 100 
r — 1-00 r = 0:00 r — 0:85 


In the first table all the best 20 on the test turn out well and the 
bottom 80 badly ; correlation is perfect. In the second, r is zero and 
our selection is no better than picking names out of a hat, since as 
large a proportion of the 20 who pass do well later as of the 80 who 
fail the test (namely 20% of each). Nevertheless, 64 and 4 are 
correctly allocated, and the proportion of errors is 32%. The third 
Situation represents the typical situation in secondary school 
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selection, where the battery of intelligence and attainments tests cor- 
relates about 0-85 with later school work. The number of correct 
allocations is now 88, and of errors 12°%. (Note, however, that nearly 
one-third of the 20 Passes who actually get into the grammar school 
gain unsatisfactory marks later.) Thus we have reduced errors by 
32 — 12 a А 

ayes 624%. Here are the corresponding reductions for other 
levels of r. 


Reduction of errors 
in prediction % 


6 
12 
17 
23 
30 
37 
46 
57 
70 
80 

100 


-ooooooooooo 


со -!4 буол оо-ох 
225539588555 


The Influence upon Correlations of Irrelevant Common Factors.— 
A positive correlation which is statistically significant (i.e. 3 or 
preferably 44 times its P.E.) always indicates some kind of causal 
connection, or some common factor or factors running through the 
two variables—even when the connection is not strong enough for us 
to be able to forecast one variable from the other. Considerable care 
is needed, however, in interpreting such a connection, for its real 
nature may be quite unsuspected, or the common factors may be of 
an entirely irrelevant kind. Suppose, for example, that we correlated 
the size of boots worn by all the children in a school with the speed 
of their handwriting, we should undoubtedly obtain a moderately 
high coefficient. Yet it would be absurd to say that big boots cause 
quick handwriting, or vice versa. The real reason for the correlation 
is an irrelevant common factor, in this case age. The older children 
on the whole write more quickly and have larger feet than the 
younger ones. If we took children all of the same age and compared 


boots with handwriting, we should be likely to find that the corre. 
lation had fallen to zero. 
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Consider another, rather more complex, illustration. A positive 
correlation exists between backwardness in school work and un- 
employment among the children’s parents. Idealists may at once 
jump to the conclusion that unemployment brings about malnutri- 
tion and anxiety among the children, and that these make their work 
fall off. But here also there is an important common element which 
goes some way to explain the connection. Unemployed parents are 
known to be on the whole less intelligent than employed ones, and 
unintelligent parents tend not only to have unintelligent children, 
but also to give them less educational encouragement than do in- 
telligent parents. These factors—low intelligence and generally un- 
favourable environment—are far more important causes of back- 
wardness than is malnutrition. We need to take two groups of chil- 
dren, one with parents employed, one unemployed, groups whose 
average intelligence level is identical, and then study their school 
work to see whether the former do better than the latter. Only when 
we have ‘held the intelligence factor constant’ in this example, or 
‘held the age factor constant’ in the previous example, can we 
legitimately deduce some causal connection. 

Partial Correlation.—The above type of problem or difficulty in 
interpretation is constantly occurring in psychology and education. 
Sometimes it can be met by applying what is known as the partial 
correlation technique. For this it is necessary to obtain measures of 
the suspected irrelevant factor, which we shall call Ху, and to work 
out the correlations between X, and X; (r13), and between X, and 
X; (ға), as well as between X, and X, (ғ). We may then “hold X; 
constant’ or ‘partial it out’ by the following formula: гу». (meaning 
the correlation of X, with X, when X; is eliminated) = 


VI — пут тз 


Ex. 53.— Suppose the correlation between boots and handwriting 
speed to be + 0:45, that between boots and age +- 0-70, and between 
handwriting and age + 0-60. Then the correlation between boots 
and handwriting when the influence of age is removed is: 


1 Thouless (19392) has pointed out that the partial correlation formula can 
only legitimately be used if X, can be measured without error. It is thus most 
appropriate for holding age constant, but not for, say, intelligence; since 
measurements of the latter always involve some error. However, Thouless pro- 
vass an alternative formula which may be used when the reliability of X, is 

nown, 
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Som :49 50:70 0:601 + 0:052. 

V1—0-70:V1 — 0-602 

The Effect of Homogeneity and Heterogeneity upon Correlations. — 
We have just seen that when pupils are heterogeneous as regards age, 
the correlation between a pair of their characteristics may be 
spuriously raised. The same is true of other types of heterogeneity. 
Suppose we calculate the correlation between two tests in an ordinary 
school class, and then eliminate from consideration most of the 
Pupils of medium ability on both tests, and re-calculate the correla- 
tion, the second figure will certainly be larger than the first. The 
distributions will not, of course, be normal, the diversity or hetero- 
geneity of the pupils will be unduly large and this will increase the 
size ofr. In ordinary practice the reverse, namely undue homogeneity, 
is more likely to occur without the investigator realizing it. Thus the 
Correlations between intelligence tests and school work in an ordinary 
school class are usually lower than they would be in an unselected 
group of children of the same age. For such a school class is more 
homogeneous as regards intelligence and school work than is the 
population as a whole. The dullest children of that age are likely to 
be at a special school, and many of the brightest ones may be at 
private schools. Take an extreme case: if we selected a group of 
children all of precisely the same intelligence, any correlation be- 
tween intelligence and work would disappear. One important reason 
why intelligence tests appear to be of much less value for predicting 
scholastic aptitude among secondary grammar than among primary 
pupils, and to be poorer still among university students, is that 
secondary pupils are more homogeneous or more highly selected 
than primary. Most of the children of I.Q. less than 110 are weeded 
out before the grammar school is reached, and university entrance re- 
quirements remove many more from the middle and lower end of the 
Scale. Thus correlations necessarily sink as we pass from unselected 
children to the primary school, from the primary to the grammar, 
and from the grammar to the university level. The typical correlation 
of + 0-85 which we have quoted for selection tests at 11 -+ with 
later school work drops to about -+ 0-45 within a grammar school 
group. 


Itis not easy to decide just what degree of heterogeneity should be 
regarded as normal and reasonable, nor to specify the extent to 
which any given group of persons exceed or fall short of this degree 
of heterogeneity. Hence the correction of correlations either for 
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undue homogeneity or excessive heterogeneity is complicated. The 
reader may be referred to discussions by Thomson (1939) and 
Thorndike (1949). 

The main conclusions to be drawn from the above sections is that 
the investigator should always scrutinize his correlational data very 
carefully, first in order to see whether any irrelevant common factors 
may be increasing spuriously the size of his coefficients, secondly to 
judge whether or not the diversity of scores in the group tested is 
unusually wide or restricted. He should realize that the absolute size 
of a correlation means very little, since this size may be so much 
altered by uncontrolled common factors or heterogeneity. His inter- 
pretation of the amount of relationship between the correlated 
variables should be made relative to the particular data with which 
he is working. 

In contrast to the absolute size, the predictive value or forecasting 
efficiency of a correlation is not affected by heterogeneity. For the 
latter, as we have seen, depends on o/I — га. Hence an increase in 
r on account of excessive heterogeneity is offset by the increase in c, 
and the P.E. remains the same. It should be noted further that fore- 
casting efficiency is not in the least upset by irrelevant common 
factors. True, it might be foolish to predict speed of handwriting from 
the size of boots, yet the correlation obtained between these variables 
would provide a valid means of making such predictions. 


COMBINING TWO OR MORE TESTS 
The educational statistician is frequently faced with the problem 
of combining the results of several examinations and tests, and using 
the total scores for the selection of the best pupils. We have already 
dealt rather fully with the point that the distributions to be combined 
should be expressed in equivalent units, i.e. that they should possess 
the same means and S.D.s, except when it is desired to give more 
weight to one set of marks than to another, in which case the former 
should have the larger S.D., not necessarily the larger mean, Here 
we must draw attention to an important fact which markers seldom 
realize, namely that the S.D. or dispersion of combined variables is 
always less than that of the separate variables by an amount depend- 
ing on the lowness of correlation between these variables. This fact 
is a natural corollary of the facts of regression outlined above. 
Suppose, for example, that total marks are based on the average 
of six examinations, each with a mean mark of 60, a range of 30 to 
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90, and a S.D. of 10. If the average correlation between the six 
variables is + 0-70, the final distribution will probably have a S.D. 
of 8:66 and the average totals will range from 34 to 86. But if the 
average inter-correlation is only + 0-30, the final distribution will 
have a S.D. of 6-43, and the range will drop from 30-90 to 41-79. It 
is easy to see why this must be so. When the average inter-correlation 
is moderate or low, no pupil or student will get very high marks, or 
very low ones, on all the examinations. Hence the highest and lowest 
average marks will be distinctly nearer to the mean than the highest 
and lowest marks on each separate examination. The same phenom- 
€non occurs when marks on different questions in a single exam- 
ination are totalled. For instance, if five questions are marked, each 
with a range of 5-20 out of 20, the examination totals will probably 
range from about 40-85, and not from 25-100. 

A chief examiner, head teacher, or other marking authority should 
make allowance for these facts. If he wishes the distribution of final 
totals to show a certain proportion of very high (over 80%) marks, 
and a certain proportion of low (under 40%) marks, then he must 
either direct the markers of the component examinations to employ 
extra-wide distributions (i.e. to award much larger proportions of 
marks over 80% and under 40%), or else he must re-scale the final 
marks so as to spread them out again into a distribution with the 
required dispersion. The formula connecting c, the S.D. of the com- 
ponent distributions, and o,, , the S.D. of these distributions averaged, 


18: ow. = нп =D where R is the average correlation be- 
tween all the sets of marks, and п the number of sets.! 

Multiple Correlation.—1f two tests or examinations each correlate 
moderately with a third, then the two combined will usually correlate 
higher with the third than either separately. For example, English 
and arithmetic examinations at 11 -+ usually yield correlations of 
about + 0:40 with subsequent achievement in grammar schools. The 
combined mark is likely to correlate about + 0-45 with such achieve- 
ment. Every educationist has realized and applied this principle. 
With the aid of statistical methods we can formulate it more precisely. 


Let there be n tests which give correlations rj4, Гад Гад + · . CtC., 
Alternatively, 2°" 


may be substituted for day., where Gsm is the S.D. of the 


total scores. The same formula, turned around, is often useful for finding the 
average inter-correlation of п variables when their separate and combined 5.0.5 
are known. Cf. Kelley (1924), Formula No, 171. 
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with some measure of achievement, A, the sum of all these coeffi- 
cients being Zra; and let the correlations of the tests with one an- 
other be Рз, Газу Гоз - -  €tc., their sum being Zr,,. Then the correla- 
tion of A with the combined tests is: 
Хү, 

еа 

AM AP ty ) vn ae 2Xrns 
_ Ex. 54.—Three different types of intelligence tests gave correla- 
tions of + 0:483, + 0:456, and + 0:337 with students’ marks in 


psychology. Zr, = 1-276. Their inter-correlations were + 0:507, 
+ 0-468, and + 0:508. Zr, = 1-483. What will be the correlation 


of the combined tests with psychology marks? 
1-276 
na + а + = 532x148 


+ 0-522. 


The above method is a simple but crude way of combining tests 
so as to give better predictions of achievement. Naturally some of 
the tests are superior to others, and so should receive more weight 
in the total score. Further, the most useful tests will be the ones which, 
as well as correlating highly with A, do not correlate highly with the 
other tests that are being employed. For obviously, if tests 1 and 2 
recisely the same thing, they will predict, as 
f A. Whereas what we want are tests which 
predict different aspects, and so between them cover the whole of A 
more thoroughly. Thus it may be better to leave out test 2, and 
substitute another which, while perhaps correlating less well with A, 
has the advantage of being almost independent of test 1. The method 
known as multiple correlation enables the educationist or psycholo- 


gist to choose from a number of tests those which will in combination 
lation with A, and to weight them 


give the highest possible corre 
suitably. Multiple regression equations can also be calculated which 


predict an individual's most probable score on A by combining his 
marks on the separate tests in appropriate proportions. The method 
is too elaborate to be given here, but the above discussion will, it is 
hoped, indicate its object and main principles. 

The vocational psychologist applies an analogous procedure in 
selecting the best candidates for a certain occupation, that is, he 
gives a series of tests each of which has been found to correlate 
moderately with success at the occupation, and combines them by 
the multiple correlation method so as to obtain as accurate predic- 
tions as possible. 


are measuring almost p 
it were, the same aspect o 
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TEST RELIABILITY 


_ If a test or examination is applied a second time under similar 
conditions, and the testees’ scores differ widely from those previously 
obtained, the test is obviously a poor one. It is said to be reliable 
only if the two sets of scores correlate highly with one another. 
Further, if different testers apply and score a test or examination, 
they should arrive at the same, or nearly the same, scores. Repetition 
of a test may, however, give an unfair picture of its reliability, since 
the testees may remember their previous responses. Many tests are 
therefore supplied in two or more parallel forms, so that if a re-test 
is desired, different questions may be set which should, nevertheless, 
yield much the same results as did the questions of the first test. 
When no alternative form is'available, a single test is often split into 
two equivalent halves. For instance, the scores on odd- and on even- 
numbered questions or items may be totalled separately, and then 
inter-correlated. But it is known that test reliability depends on the 
length of the test, hence the correlation between one half and the 
other half is unduly low, and is usually corrected by the formula: 


R= ШЕБЕР = 3 Here r is the obtained coefficient, and R the coefficient 


to be expected had it been possible to compare the whole of the test 
with another similar test. 

In all these instances—namely repetition, application and scoring 
by a different tester, parallel form, and corrected split-half—we 
generally expect to get an inter-correlation of at least -+ 0:90. For if 
this reliability coefficient is much lower it would indicate that the 
Scores are too unstable to be trusted. Standardized educational and 
intelligence tests generally reach this level of reliability, but ordinary 
examinations often fail to do so, as we shall see in Chap. XI. Many 
performance tests, tests of occupational abilities, and character or 
temperament tests, are also often poor in reliability. However, by 
increasing the length of a test, or by combining several parallel tests, 
a more satisfactory coefficient may be attained. 

Knowing the reliability coefficient, it is possible to calculate the 
reliability, or the limits of variation, of individual scores, by the 
formula already given: P.Evcstimx, = 0-67450,V1 — 75. For ex- 
ample, if parallel intelligence tests give a correlation of -+ 0:93, and 
the S.D. of their 1.0.5 is 15, then Р.Е о. = 0:6745 x 15V1 — 0-932 

= 3:71. But if the reliability coefficient is only + 0:70, Р.Ело, = 
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7-21, that is almost twice as large. Such a P.E. is, as we have already 
seen, an index of the extent to which scores on the second test may 
vary when predicted from scores on the first test. About half the 
testees will obtain I.Q.s on the second test within 4 points (if 
r = 0:93), or about 7 points (if r = 0-70), of their I.Q.s. on the first 
test. But rare cases (1 in 100) may alter by as much as 4 x P.E., i.e. 
by some 15 and 29 or more points respectively. 

The application of two parallel forms of a test to the same indi- 
vidual yields, as we have seen, somewhat different results. If we 
possessed a large number of additional parallel forms and applied 
them, still other results would be obtained (quite apart from effects 
of practice). Such results for one individual would tend to conform to 
a normal distribution, and their grand average would, of course, be 
the best possible measure of the individual's standing on the test— 
what may be called his true score. The situation is exactly analogous 
to that which we discussed in connection with group differences 
(Chap. V), except that there we were concerned with variations 
among the means of sample groups, here with variations among an 
individual’s scores on sample tests. An individual's score therefore 
possesses another P.E., which indicates its liability to deviate from 
his true score. The formula for this also involves the reliability of 
the test, and the S.D. of the test scores of an unselected group of 


persons. P.E. = 0-67450*/1 — r. Thus the P.E.s of I.Q.s for tests 
with reliability coefficients of + 0:93 and + 0-70 are 2:67 and 5:53 
respectively. Note that these P.E.s are smaller than those quoted 
above, since 4/1 — ris necessarily smaller than VI — r°. This is to 
be expected, since the former P.E. represents the deviation of one 
sample score from another sample, whereas the latter represents the 
deviation from the mean of all the possible samples. 

Factors Influencing Test Reliability.—Since the reliability of a test 
is almost synonymous with its thoroughness, it can be increased or 
lowered almost indefinitely by lengthening or shortening the test. 
The formula for predicting the reliability of a test whose length is 
doubled has already been cited. When it is lengthened z times, the 
formula becomes: R = ir = Dr This is known as the Spear- 
man-Brown prophecy formula. 

EX. 55.—A short educational test has а reliability coefficient of 


+ 0-60. How much must it be lengthened to attain a coefficient of 
+ 0:90? Turning the above formula round, we get: 
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п КО —r _ 0:90 х 0:40 
КТ) 0:60 x 0-10 

It should be six times as long. 3 

The same calculations would apply to the following example. 

If two examiners mark a set of examination scripts, and their marks 

inter-correlate -- 0-60, how many examiners should mark the scripts 


for their combined marking to attain a reasonable reliability of 
+ 0:90? The answer is six. 


6 


The formula assumes that the correlations between the additional 
tests (or examiners) would be the same as between the first pair of 
tests (or examiners). This condition seems generally to be satisfied 
by educational measurements. 

A second important factor is the heterogeneity of the group whose 
Scores provide the reliability coefficient. Parallel forms of an in- 
telligence test will inter-correlate much more highly if applied to 
school-children of several years’ age range than when applied to a 
single class or to a single year-group (i.e. children whose ages range 
over only one year). Coefficients obtained from the latter groups are 
Conventionally accepted as fairer indices of reliability than (со- 
efficients (тог (ће former. If we know c, the S.D. of test scores ina 
group which is unduly homogeneous or heterogeneous, then it is 
possible to predict R, the reliability to be expected in a group with 


S.D. 2, by Kelley's formula: R — 1 — = (1— r). 


Ex. 56.—The reliability coefficient of an intelligence test when 
applied to a class of primary school-children was + 0:88. The S.D. 
of their 1.0.5 was 14. What would be its reliability in a secondary 
school class whose I.Q.s have a S.D. of only 9? 


R=1-2 x o12= 40-71 
Note that the P.E. of an individual’s score is not upset by this 


factor, as is the reliability coefficient. In the above example it works 
out at + 3-27 in either class. 

The assessment of test reliability and its implications are complex 
topics, and the account given here covers only the more elementary 


points. Advanced students are advised to consult Thorndike (1949) 
and Gulliksen (1950). 


Function Fluctuation.—Most reliability coefficients fall below 1:0 
for two distinct reasons, first, through the test being insufficiently 
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thorough (i.e. liable to errors of sampling), and, secondly, because the 
persons tested or examined may alter in their level of achievement 
between one test and another. The latter factor has been termed 
‘individual variance’, or, ‘function fluctuation’ (Thouless, 1936). 
Naturally it has a greater effect on coefficients obtained from parallel 
tests given on different occasions, or from repeated tests, than it 
does on coefficients obtained by the split-half method. Hence the 
difference between these two types of coefficient affords a means of 
distinguishing between the reliability of the test itself, and the ten- 
dency of testees to fluctuate in respect of the ability tested. Thouless 
provides a formula for estimating function fluctuation from this 
difference. 

Effects of Test Reliability on Test Inter-correlations.—A test which 
is not perfectly reliable may be regarded as measuring so much of the 
function which a perfect test would measure, and so much error. 
The error part of it is quite useless, and will have the effect of re- 
ducing the correlation between this test and any other test. If 
both tests possess considerable errors through lack of reliability, 
their inter-correlation will be considerably reduced. In fact, the 
maximum possible value of such an inter-correlation will be, not 1-0, 
but the square root of the product of their reliability coefficients. 
Suppose we find a correlation of + 0-50 between a selection ex- 
amination and subsequent achievement, and that the reliabilities of 
the examinations and the measure of achievement are only + 0:70 
and 4- 0-80, Then the true correlation, if it were possible to obtain 
perfectly reliable exams and measures of achievement, would be 

SEU MEAT Di 
0:70 x 

Spearman has called this reduction effect of imperfect reliability— 
‘attenuation,’ and has provided formule by which it may be corrected 
or compensated. Such correction is, of course, of theoretical 
interest, since it tells us what correlations to expect between various 
psychological abilities if they could be accurately measured. But it is 
of little practical value, since completely accurate tests are un- 
attainable, and for all purposes of prediction we have to employ the 
obtained, uncorrected, correlation coefficients (cf. Thouless, 1939a), 


CHAPTER VIII 
ANALYSIS OF ABILITIES 


Fallacious Views of Mental Organization.—Correlational investiga- 
tions have assumed especial importance in recent years, since they 
enable us to clarify our conceptions of human abilities, and of the 
organization or structure of the mind. The layman’s notions as to 
what abilities exist, and what each ability includes or excludes, are 
extremely loose, and the same was true of psychological theory 
throughout most of last century. Teachers and parents are heard to 
remark: “Johnny is poor at book-learning, but he makes up for it 
by his cleverness with his hands. He will be a good mechanic when 
he grows up.” “Магу has an excellent memory and good power of 
attention." “Willie is slow but sure,” and so on. Such statements are 
now known, as a result of experimental investigation, to be largely 
fallacious. Only exact study can tell us how far the slow person is 
likely to be sure, whether manual ability in a boy has any predictive 
value for mechanical ability in a man, whether there is any such 
entity in the mind as memory. Again, the so-called faculty school of 
psychologists of a hundred years ago considered the mind to be 
made up of a number of distinct powers or faculties such as reason- 
ing, judgment, imagination, etc. Indeed, it was at one time regarded 
as the main object of education to train each of these faculties by 
providing children with appropriate ‘mental gymnastics.’ As soon, 
however, as scientific psychologists tried to define the faculties pre- 
cisely and measure them, and to determine the effects upon them of 
various types of training, they fell into discredit. А priori analysis has 
not even been able to decide what is the true nature of intelligence. 
The accounts of it given by different writers have been extremely 
varied and discrepant. Nowadays, therefore, such problems, both of 
theory and practice, are approached by means of experimental 


studies of correlations between tests. For they all eventually resolve 
into the question—what correlates with what? 


The Scientific Definition of an Ability —First we must define what 
we mean by an ability, capacity, or faculty. It implies the existence 
of a group ог category of performances which correlate highly with 


ANALYSIS OF ABILITIES 131 


one another, and which are relatively distinct from (i.e. give low 
correlations with) other performances. Take, for example, mechan- 
ical ability. Some people are better than others at tasks involving 
manipulation of mechanisms, and this ability is fairly consistent or 
general in the sense that those who are good at one such task are also 
usually good at others. The consistency is not perfect. A may be 
especially good with meccano, less clever with locks, B the opposite, 
C the best with electrical fittings, and so on. But if there was no 
appreciable correlation between these and other similar perform- 
ances, if they were all found to be specific, we should not be entitled 
to regard mechanical ability as a real entity. Another important 
condition is that the performances should be fairly reliable. We do 
not expect perfect stability, but if people fluctuated wildly in their 
success at mechanical tasks from day to day, we should hardly 
recognize the existence of mechanical ability. There is a further possi- 
bility, namely, that the various performances may inter-correlate 
well, but that the correlations may be accounted for by some other 
common factor such as general intelligence (the age factor, we will 
assume, has already been eliminated). Mechanical ability would not 
then be anything distinctive. However, intelligence tests can be 
applied to the same persons who take the mechanical tests, and 
intelligence can be partialed out or held constant. It is then found 
that the mechanical tests still overlap, or show positive correlations 
with one another over and above their correlations due to the intelli- 
gence factor. We are, therefore, able to accept mechanical ability as 
something consistent and distinctive. Tests of it do correlate suffi- 
ciently well and reliably for us to postulate some common element 


running through them. 

A. common element such as t 
factor, since it occurs in a grou 
restricted type. It differs from а gı 
which is found to run through an extre 
indeed enters to some extent into all abilities. 

Since the consistency or overlapping of different mechanical tests 
is not perfect, no one test can give à really adequate measurement of 
mechanical ability. But by combining several tests We are likely to 
get a result much more representative of the group factor as a whole. 
The same applies, of course, in the measurement of educational 
abilities. We do not expect long-division sums alone to tell us how 
good pupils are at arithmetic, although they provide some indication 


his is often referred to as a group 
p of performances of a certain 
eneral factor like intelligence, 
mely wide range of tests, which 
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of the ability. Instead we set several types of sums, and try to cover 
the whole field of arithmetic which the pupils are supposed to know. 

Arithmetic is likely to yield a group factor similar to the mechani- 
cal ability factor. But many of the faculties assumed by psychologists 
in the past, or by laymen at the present time, fail to stand up to the 
criteria enumerated above. Memory, for example, refers to so many 
different things, that it is meaningless to talk of so-and-so as having a 
good or bad memory in general, or of * training children's memories. 
The speeds with which people learn various sorts of material, and 
their retentiveness (i.e. the amounts of such material which they can 
recall after an interval), give only moderate or low inter-correlations, 
especially when the influence of intelligence is removed. Probably 
there exist several small group factors, each representing memory 
for a certain narrow range of material.' In other words, there may be 
Several different types, but no single entity, of memory. 

The notion that mental speed differs from, and is usually opposed 
to, mental accuracy or power also appears to be fallacious. If there 
were two quite distinct abilities, then tests of speed should give high 
correlations with one another, and low or negative correlations with 
tests of accuracy or power. Under appropriate conditions of testing, 
difficult untimed tests and easy speeded tests do yield somewhat 
different results; thus a partial distinction is indicated. But they still 
Correlate positively to a moderate or a high degree, showing that, on 
the whole, those who are ‘slow’ at mental work are more likely to be 
“unsure” than ‘sure.’ 

All Mental Abilities are Positively Inter-correlated.—Let us now turn 
to the wider problem—what abilities does the mind contain, and how 
are they organized or inter-related? A first, extremely important, fact 
is that all tests of mental abilities tend to give positive inter-correla- 
tions. Even tests of manual, physical, and other non-intellectual 
functions also usually correlate positively with one another and with 
mental tests, though the coefficients are often very small, whereas 
the coefficients among tests of intellectual functions are generally 
moderate to high. In selected groups, such as university students, 
occasional negative correlations may arise; but these are not typical, 
and they are seldom statistically si gnificant. In other words, talent and 
versatility, more often than not, go together; a person outstandingly 


1A summary of the evidence on this and other types of ability, obtained 
b: 


etween 1940 and 1950, is given in the writer’s book, The Structure of Human 
Abilities (1950). 


ANALYSIS OF ABILITIES 133 


good or bad in one field is usually above or below average, re- 
spectively, in all other fields. This fact at once serves to cast a great 
deal of doubt on the ‘compensation theory,’ i.e. the view that those 
who are poor at one thing make up for it by being good at something 
else. It does not entirely disprove it, for the following reasons. 
First the reader should study the scattergram for a very low 
positive correlation, such as that shown in Fig. 26. This compares 


INTELLIGENCE TEST (X2) 
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Fic. 26.—Scattergram and regression lines illustrating a low correlation, 
qu 2 " 
the marks of students in an examination in hygiene with their in- 
telligence-test scores, and yields an г of + 0-146. It will be seen that, 
though there is still a slight tendency for high scores on X, to be 
associated with high scores on Xe, yet the correlation admits of many 
big exceptions. Students who are among the highest 3% for intelli- 
gence may range right down to the lowest 18% for hygiene. The 
large amount of regression means that a fair proportion of persons 
can be high on one variable, low on the other. Such a state of affairs 
M.A.—10 
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is certainly much more likely to occur here than it is when the cor- 
relation is high (as in Fig, 23). 

Now *book-learning' and ‘being clever with the hands’ are not 
highly correlated, hence it is quite possible for some children to be 
much better at one than at the other. But it is definitely false to 
Suppose that they are inversely related, that those who are bad at 
One are therefore good at the other. On the contrary, those who are 
superior in intellectual abilities are on the average superior in prac- 
tical ones also, though to a lesser extent. ў 

А second reason is that age differences often obscure the issue. 
Some teachers are incredulous when told that children of superior 
intelligence tend to be superior also in growth and physical capacities 
to dull children, since they know well that the dullards in an ordinary 
school class are usually larger and stronger than the bright young- 
Sters. But, then, they are forgetting that the dullards are far older 
than the bright ones, and so may be physically superior. If the dull 
children were compared with bright ones of the same age, the 
physical Superiority of the latter would at once be obvious. Never- 
theless, in this instance also the correlations are so low that a great 
many exceptions are possible. 

Thirdly, ability is obviously to some extent a matter of interest 
and practice. Pupils who are backward at intellectual studies and 
who are on the average backward, but much less so, in physical and 
manual pursuits, naturally devote more time and energy to such 
pursuits. Bright pupils, though initially stronger and better at prac- 
tical matters, may be more interested in intellectual matters, and so 
fall behind in other things. It is possible, therefore, for the differences 
that exist between abilities which give low inter-correlations to be- 
come accentuated, though there is still no evidence known to the 
writer that the correlations ever become negative. 

We may conclude, then, that all the abilities with which teachers 
are concerned are, to a greater or lesser extent, positively correlated. 
As an example, we will quote some of Burt’s (1917) results. He gave 
thirteen objective tests of achievement in various school subjects to 
120 children aged 11 +, and calculated the inter-correlations. The 


following table shows the average correlation between each test and 
the other twelve tests. 


* The average reliability of the tests was + 0:738. All these coefficients would, 
therefore, be larger by roughly 35% if the tests had been perfectly reliable (cf. 
р. 129). A later repetition of the investigation with a much larger number of 
children has yielded closely similar results (Burt, 1939а). 


A 
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Composition, history, geography, nature study, and arithmetic 
problems overlap with the other tests to the largest extent. Mechani- 
cal arithmetic, drawing, and writing quality correlate to the smallest 
extent. This shows that there is a tendency for a pupil’s achievements 
in all subjects to run on a fairly even level, though the correlations 
are low enough for some pupils to exhibit considerable divergences— 
special strengths in some subjects, special weaknesses in others. 
During the secondary school stage, correlations between subjects 
decrease, partly perhaps because abilities tend to differentiate during 
adolescence and specialised talents develop, but mainly because the 
pupils form more homogeneous groups (cf. pp. 122-3). Nevertheless, 
the positive overlap of all intellectual abilities continues even beyond 
the university level. For example, the present writer inter-correlated 
the marks of 300 graduate students at a training college, and obtained 
small positive coefficients between subjects as widely different as 
teaching skill, arithmetic, hygiene, education, psychology, speech 


training, singing, physical training, etc. (cf. Vernon, 1939). 
Composition  . d Я . + 0:505 
History n У + 0-448 
Geography > Р ч . + 0:456 
Nature Study . а : . + 0:458 
Mechanical Arithmetic ^ . + 0:283 
Arithmetic Problems . b . + 0-457 
Reading Fluency А : : + 0:323 
Reading Comprehension. . + 0:369 
Writing Speed . z > - + 0.323 
Writing Quality 5 2 . 40274 
Dictation . 3 : . + 0-325 
Drawing + 0:275 
Handwork + 0:320 


These facts are of considerable importance to examiners. They 
indicate that a General Certificate in several subjects, achieved at the 
age of 15 to 18, is some criterion of aptitude for any intellectual 
occupation, or for a university career, albeit not a very good one. 
Similarly, the traditional method by which the Civil Service selected 
people for a wide variety of posts was а general examination based 
largely on subjects studied at school or university. This implied that 
the possession of ‘brains’ for one subject indicates aptitude for all 
other subjects. Though this method was a fairly sound one, the 
Civil Service has now come to realize that the efficiency of its 
selection сап be improved by adding tests of the aptitudes needed 
for each type of post. 
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The General Factor Responsible for Positive Overlap.—Now when 
we find positive association, as in Burt's investigation, we are justified 
in assuming the existence of a general factor—general educational 
ability—which runs through all the separate subjects. Some sub- 
jects may be said to be more ‘saturated’ with it than others, since they 
are more highly correlated with it. In the primary school we should 
get an excellent indication of this general factor from pupils’ marks 
in composition and in arithmetic problems, a much poorer indication 
of it from drawing or handwriting quality. Statistical methods are 
available for assessing a pupil’s standing on this general factor from 
his marks on all the component subjects (cf. p. 124). Further, it is 
quite easy to partial out the factor, and so to see whether or not it 
accounts for all the correlations between the separate subjects, 
whether, in other words, it is the only common element involved. 
When this is done, it is found that there are still a number of corre- 
lations over and above those due to the general factor. In Burt’s 
Study, the residual correlations indicated considerable overlapping 
within the following four groups of tests: (1) arithmetic problems 
and mechanical arithmetic; (2) handwork, drawing, writing speed 
and quality; (3) dictation, reading comprehension and speed; 
(4) composition, history, geography, and nature study. But there 
was little or no overlap, or even negative correlations, between the 
tests in one of these groups and those in another group. Such results 
show clearly the existence of group factors, i.e. of distinctive types of 
educational abilities, over and above general ability. In this instance 
there is an arithmetical factor, a manual one, a linguistic one, and 
one including all the higher, more integrative, school subjects. As 
Burt points out, we have here a basis for cross-classification of 
pupils. He suggests that they should be assigned to school classes in 
accordance with their achievements in the linguistic and ‘integrative’ 
Subjects, and that they might advantageously be reclassified for 
arithmetic lessons, and again for manual subjects. 

Analogous findings were obtained in the writer’s study of training 
college marks. Three Separate ‘families’ of subjects could be dis- 
tinguished, a practical one (teaching skill, speech training, and 
physical training), a scientific one (psychology, hygiene, arithmetic). 

and a literary one (English, speech training, education, history). Few 
researches along these lines have been carried out in the industrial 
sphere, owing to the difficulties of measuring and correlating 
different vocational abilities. But we might expect to get from them 
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similar results, for example, group factors for the engineering type 
of occupation, for the salesmanship type, the clerical type, and so on. 


FACTORIAL ANALYSIS 

The statistical investigations of Spearman (1927) and others have 
shown that it is possible to account for practically the whole of a 
set of test inter-correlations by postulating appropriate common 
factors. It is legitimate, then, to regard each test as made up of two 
components, or as measuring two things. In so far as it correlates 
with other tests, it is measuring some factor or factors. This is often 
referred to as the test’s communality. But in so far as it fails to 
correlate (and most test inter-correlations, be it remembered, are 
only moderately high), it is measuring a purely specific component, 
or something which is peculiar to that test alone and has no relation- 
ship to any other test. This is called the test’s specificity. It includes, 
of course, the error which is present in the test on account of im- 
perfect reliability? (cf. p. 129). Thus any test of an educational or 
vocational ability can be analysed into certain proportions of certain 
factors. For example, in Burt’s research, the arithmetic problems 
test can be analysed into so much general educational ability, so 
much arithmetical group factor (these together make up its com- 
munality), and, thirdly, a component specific to that particular test. 
The mechanical arithmetic test can be analysed into a rather smaller 
proportion of general factor, a large proportion of the arithmetic 
group factor, and a separate specific component. The presence of 
these specific components reduces the correlation between the two 
tests, but it is still quite high, namely + 0:76, because they have both 
a general and a group factor in common. A test of another subject, 
such as handwork, still correlates positively with arithmetic prob- 
lems, because they are both partially saturated with the general 
factor, but the coefficient is a small one, namely + 0:38, since hand- 
work embodies a different (manual) group factor, as well as a 
different specific component. 

1 Note that we do not claim to be able to account for the whole of the correla- 
tions, for such correlations are, of course, always liable to errors of sampling. 
Often the residual coefficients, which are left after a general factor and some 
group factors have been extracted, are so small relative to their P.E.s, that it is 


not worth while analysing them further. Even with the most accurate techniques, 
such as Hotelling's and Burt's, the later factors are generally so small that little 
significance can be attached to them. 

* Thomson (1939) points out, however, 
unreliability, e.g. those of the function f 
mon to several tests, and so contribute to 


that some of the influences producing 
uctuation type, may actually be com- 
their communality. 
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By means of factor analysis, then, it is possible to reduce a large 
number of test results to a few underlying common factors. Any 
psychological test, or measure of a scholastic or occupational 
ability, can be regarded as made up of some of these factors, or as 
Possessing a certain ‘factor pattern.’ Further, an individual person’s 
Standing on each factor can be derived from his test scores, and he, 
too, possesses a certain factor pattern. Thus, a pupil who is high in 
general educational factor will be above average in most of his school 
work, but his relative goodness at different subjects will depend on 
the strength or weakness of his group and specific factors. 

Dangers of Factor Analysis —At this point a strong warning should 
be given against carrying the factorial conception of abilities too far. 
Factors are not entities in the mind whose nature or constitution, 
and whose strength or weakness, are immutably fixed. Some writers 
do seem to assume that factorial analysis is revealing the funda- 
mental elements of which human minds are compounded, much as 
chemical analysis reveals the elements from which chemical sub- 
stances are compounded. We should realize that factors consist 
primarily of categories for classifying mental tests and examinations. 
Their value lies in the fact that they tell us objectively what correlates 
or overlaps with what, and so they take the place of unverified 
faculties and other subjective conceptions of human traits and 
abilities. 

It is obviously illegitimate to compare factors in the educational 
field with chemical elements. For these factors, which are deduced 
from the correlations between various school subjects, must depend 
largely on how the subjects are taught and on the stage which the 
pupils have reached. Different factor patterns will, therefore, be 
found in different school classes, according as the teachers connect 
up, or bring about overlap in their pupils’ minds between, the sub- 
jects. For example, Oldham (1937-8) found very diverse relation- 
Ships between arithmetic, algebra, and geometry among different 
12- to 13-year-old school classes, which could be ascribed to differ- 
ences in teaching methods and in the pupils’ attitudes towards the 
subjects. Others have proved that the factors extracted from a set of 
mental tests alter progressively with practice at the tests. 

Like all other statistical constructs, factors will be liable to a good 
deal of variation if they are derived from measures of small numbers 
of persons. But apart from such sampling errors, they must be 
regarded as depending to a considerable extent on the particular 
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group of persons measured. Thomson (1939) shows that they are 
markedly affected by the heterogeneity or homogeneity of this 
group of persons. In addition, they are dependent upon the particular 
set of tests used, since the factor pattern of a test is in essence an 
expression of the relations between this test and all the other tests in 
the set. Statistical analysis of a single test in isolation is incapable of 
telling us what factors it contains (though subjective analysis, or 
inspection of its content, may often give useful hints as to its prob-- 
able factor pattern, which can later be verified by studying ob- 
jectively its correlations with other tests). However, this limitation 
may not be very serious, since, as we shall see below, at least some of 
the most important factors emerge in almost identical form from a 
wide variety of sets of tests. 

Then there is the more subtle difficulty that the same set of tests 
can always be factorized in a great many different ways. Most of 
these ways will lead to quite illogical results, so that the investigator 
has to choose from among the various possible factorizations the 
most meaningful and convenient pattern. Take, for example, the 
educational tests applied by Burt. We have so far regarded these as 
analysable into general educational ability, four group factors for 
related subjects, and separate specific factors for each test. Now Burt 
showed, incidentally, that the general factor was not identical with 
general intelligence as estimated by teachers, or as measured by 
intelligence tests, though the two overlapped closely. But it would 
have been quite legitimate to analyse measures of intelligence along 
with the educational measures, and to partial out intelligence first, 
so making it the principal factor underlying achievement. If this 
had been done, it is likely that a subsidiary general factor would 
have emerged representing the pupils’ application to, and interest 
in, their work in general. Intelligence plus this subsidiary factor 
would then make up what we have previously denoted as the 
general educational ability factor. The group factors would also need 
to be somewhat modified, since most of the fourth one (composition, 
history, etc.) might have been absorbed into, or accounted for by, 


the intelligence factor.’ 

In investigating students’ colle 
analogous results. The inter-corre 
either by a very prominent general fa 


‘In his later study (19392), Burt describes several methods of factorizing 
which do yield numerically different, though logically consistent, results. 
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stricted to a few subjects, or else by three factors, all of about equal 


prominence, each entering into several subjects. 


Certain other precautions which should be observed in inter- 
preting the results of factorial analyses will be mentioned later. 

Types of Factor Pattern.—Three main types of factor pattern 
recur very frequently in educational and psychological investigations. 
They may be represented by the following diagrams, each of which 


refers to twelve 


hypothetical tests, Nos. 1, 2,... 


І. BI-FACTOR PATTERN 
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П. MULTIPLE-FACTOR PATTERN 
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| General | Specific | 


pes: Factor Factors | 
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2 х < 
| © x 
7 х 
8 e: 
9 
10 
11 
12 


In the first, which Holzinger and Hartman (1941) call the bi- 
factor pattern, there is a general factor running through all the tests, 
four different group factors, and specifics. This pattern (or similar 
more complex ones) is usually adequate for representing the results 
of tests of abilities. For instance it applies excellently to Burt’s 
investigation. 

The second, multiple-factor, pattern requires three, four, or more 
factors, none of them general, but all running through several tests, 
to account for the test inter-correlations. It will be seen that Test 1 
involves factors A and C and a specific. But as far as possible each 
test is covered by a single factor (with only negligible loadings on the 
others). This type of pattern is referred to as Simple Structure (cf. 
Thomson, 1939; Thurstone, 1947). It is often used in analysing tests 
of abilities, but is still more appropriate where tests of personality, 
character, or temperament are concerned, since in these fields a 
general trait, running through all the tests, would seldom be appro- 
priate. For example, Thurstone (1931) obtained measures of the 
interests of a group of students in a large number of different occu- 
pations—advertising, art, chemistry, etc.—and found that the interest 
scores could be classified under four main factors. These he roughly 
identified as interest in language, in science, in business, and in 
people. Each separate interest could be made up of appropriate 
proportions of these four factors. Other instances have been sur- 
veyed by Eysenck (1953). 

Spearman’s Two-factor Theory.—The third, unitary-factor, pattern 
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has aroused a great deal of controversy. Historically it was the first 
type to be discovered. Early in the present century, Spearman found 
that diverse tests of mental abilities usually gave inter-correlations 
which could be wholly accounted for (within the limits of their 
errors of sampling) by a single general factor plus specific factors. 
He enunciated certain criteria to which the correlations must con- 
form if this is to be possible, the most important of which is called 
the tetrad difference criterion.: The general factor obviously cor- 
responds to what we commonly mean by general intelligence, but 
Spearman preferred to symbolize it by the letter g, the specific factors 
by letter s,, Sa, · -  €tC., 50 as to get over the difficulty, mentioned 
above, that the term "intelligence" has been so diversely defined by 
different psychologists. His g factor, it was claimed, would be the 
same whatever the tests used. It could be measured, for example, by 
à battery of tests consisting of Analogies, Completion, Vocabulary, 
and Directions, or equally well by a battery consisting of practical per- 
formance tests (pp. 160-2). Each test would have its own independent 
5, but in so far as it overlapped with other tests, it would be measur- 
ing g. This view is widely known as the Two-factor Theory. 

For a while it was thought that the Two-factor Theory would 
apply to all tests of abilities, not only to intellectual ones, and that 
very few additional group factors would be needed. Only occasion- 
ally would a few tests which were very similar in content (e.g. tests 
of musical aptitude) be found to inter-correlate over and above their 
correlation due to g. 

Different tests may differ considerably in their g-saturations. 
Those with the highest saturation seem to involve the capacity for 
seeing relationships, whilst tests which involve mainly mechanical or 
rote memory contain very little £. Among school subjects, classics is 
most dependent on g; French, English, history, and the like are also 
highly saturated. Manual subjects and music have much more 
prominent s's and relatively little g. Thus we find high correlations 
between classics and English, low correlations between handwork 
and music, and moderate correlations between classics and hand- 
Work, or English and music. The Two-factor Theory not only 
explains these and similar facts as to the relations between abilities, 


* For an explanation, see Knight (1933), or Spearman’s own book (1927). 
The apparent discrepancy between the names—Two-factor Theory and Unitary- 
factor Pattern—is merely due to the fact that Spearman includes the specific 
components as factors, whereas the terms we have used refer only to the com- 
munality of the factorized tests. 
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but also provides the mental tester with a scientific basis for choosing 
tests which will give the best measures of g. Although every sub-test 
in a battery of intelligence tests will involve a certain s-factor, tests 
can be selected in which these s’s are small, and when the tests are 
combined the s’s will tend to cancel out, so leaving an almost pure 
measure of g. 

Criticisms of the Two-factor Theory.—The theory has been sub- 
jected to much criticism, both on psychological and on statistical 
grounds. Spearman suggested that g might correspond to the con- 
ception of ‘general mental energy.’ Whether or not this is a psycho- 
logically sound explanation need not concern us here. On the 
statistical side, Thomson (1939) and others pointed out that, al- 
though the analysis into general and specific factors is legitimate 
when the tetrad difference criterion is satisfied, yet this is not the 
only possible mode of analysis; that if mental abilities consisted of 
large numbers of overlapping group factors, the same criterion 
would still be satisfied. Thomson’s view would seem to the present 
writer to be preferable from the psychological standpoint, but 
Spearman’s view is perhaps more practically convenient, since we do 
already in everyday life assume the existence of something like g— 
whether we call it ‘brains,’ ‘brightness,’ ‘intelligence,’ or what not— 
and the Two-factor Theory provides a scientific basis for this com- 
mon-sense assumption. 

The theory, and the experiments which it has stimulated, at one 
time seemed to discredit all the mental faculties which had so often 
been abused in traditional psychology and in common parlance. But 
more recent research with modern factorial methods indicates that 
the theory requires considerable modifications in the matter of 
group factors. A large variety of group, or multiple, factors has been 
described, particularly in American researches, and some of them do 
resemble the older faculties. Unfortunately, there are considerable 
disagreements between different factorists as to which are the most 
distinctive and fundamental, but the series that Thurstone established 
is very widely accepted. Thurstone (1938) applied an extensive bat- 
tery of psychological tests to university students and found that they 
resolved into factors which he identified as verbal ability (V), 
number ability (№), word fluency (W), spatial judgment (5), rote 
memory (M), perceptual speed (P), and deductive and inductive 
reasoning (D, Г). A fairly similar picture was obtained in studies of 
children, even down to the 5- to 6-year level, and Thurstone has issued 
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Sets of tests for measuring these factors at several age levels, namely 
the P.M.A. or Primary Mental Abilities tests. Note that he does not 
allow for a g running through all these abilities, though Spearman 
and his followers have shown that a pattern of general + group 
factors fits the same results as well as, or better than, a pattern of 
independent multiple factors. 

It may safely be concluded that no test measures nothing but g and 
a specific factor, since the type of test material employed always 
introduces some additional common element. All verbal intelligence 
tests, together with other tests depending on manipulations of words, 
involve a verbal and educational factor, which has been named V or 
Y : ed. Those based on multiple-choice items, to be done at speed, 
embody a further factor or factors, not present in unspeeded, 
creative-response tests such as Terman-Merrill. In order to escape 
the influence of y; ed, many mental tests have been constructed from 
abstract diagrams or pictures. Such tests, however, in so far as they 


involve Spatial imagination or the mental manipulation of shapes, 
yielda spatial factor called k (EI Koussy, 1935), or S (Thurstone, 1938). 
Certainly some, and possibly all, of the common non-verbal per- 
formance tests bring in this same k, and various minor dexterity 
factors in addition. Mechanical ability tests too measure k, together 
with a mechanical information or experience factor, m. School 
examinations and educational attainments tests depend (as Burt’s 
research showed) not only ong and v : ed, but also on a factor which 
represents the pupils’ interest in and attitudes to work, and their 
character or temperament qualities. Following Alexander (1935), 
this is often referred to as the X-factor. In addition, there are probably 
group factors for families of school subjects such as the linguistic, 
scientific and mathematical, practical and manual, and further sub- 
factors largely governed, as we have seen, by teaching methods. 
Educational and Vocational Applications.—The findings of fac- 
torial research have important educational and vocational bear- 
ings. If the Two-factor Theory was exclusively accepted, educa- 
tional and vocational guidance by means of tests would scarcely be 
possible. If we wished to advise a child as to the type of education 


* A more detailed discussion of the pros and cons of the g + group-factor and 
the multiple-factor approaches is given elsewhere (Vernon, 1950), and it is shown 
that they can fairly readily be reconciled. 

* Alexander (1935) claimed the existence of a distinctive practical factor, F, 


in certain performance tests; but later work appears to indicate that it cannot be 
differentiated from k. 
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or the occupation for which he would be most suited, we could 
apply tests of g and so deduce the general intellectual level of the 
work he should take up. But we could not decide whether he would 
be likely to do better at classics or science, at a clerical job or an 
engineering trade, and so on, except by giving a vast number of 
tests of different s factors, or by letting him try each type of work in 
turn so as to find by practical experience which specific one suited 
him. There could be no recommending of general lines of occupation. 
To a certain extent this does seem to be true. Educational and 
vocational guidance among adolescents are extremely difficult and 
uncertain. They have to be based more upon a subjective analysis of 
interests and character traits, background and relevant training, than 
upon objective predictions by tests. Yet common experience strongly 
suggests that there do exist general categories or types of work, and 
that therefore a number of group factors, additional to т, could be 
established by further research. If this were done, guidance would 
become much more scientific. Full consideration of a candidate's 
interests and character would, of course, still be required, but his 
scores on a series of tests for each of these factors would give 
objective indications of his probable ability along each of the main 
lines of work.! Much the same is true in clinical work with malad- 
justed children or neurotic or psychotic adults. Psychiatrists and 
clinical psychologists often claim, on the basis of poorly validated 
tests, that that their patients show deterioration or weakness in such 
faculties as memory, orientation, attention, visualization, etc. Factor 
analysis indicates that most of these tests are merely rather un- 
reliable measures of g, or g + У: ed. But at the same time it can be 


ed out that when tests are given with the object 
ocational achievement, the extraction of factors 
from the tests may constitute quite an unnecessary intermediate step. A much 
more efficient method is to correlate the tests directly with measures of achieve- 
ment and then to use multiple correlation for obtaining the most accurate pre- 
dictions. While this view is, of course, perfectly true from the statistical stand- 
point, it would seem to the present writer to be practically applicable only in 
educational and vocational selection, not in guidance. It should certainly be 
used by the tester who wishes to select the best candidates for some single and 
definite educational course or job. But the problems of the vocational adviser 
are much too complex to be tackled as yet by this exact method. He is forced to 
predict a candidate's probable achievement in generalized terms, or factors. At 
the present time he is doing so largely in terms of unverified factors, derived 
from subjective considerations of tests and lines of work. The view put forward 
above is that his procedure could be made more systematic and scientific if more 


factors were experimentally established. 


1 Thomson (1939) has point 
of predicting educational or У 
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used to isolate definite group factors such as mental speed, and 
flexibility in the formation of concepts, which are especially apt to 
be affected by abnormal mental states; and it can show which tests 
are the most promising for measuring such factors. 

While we have a fairly clear conception of the main factors under- 
lying the abilities of school-children and of unselected adults, the 
Picture among older adolescents and adults of superior ability is 
much more complex and obscure. A g-factor can still be extracted 
from their mental test results, but it seems to play a decreasingly 
important part in educational or vocational achievement, and inter- 
ests, work attitudes, and temperament traits become more influential. 
Animportant series of researches by Guilford etal (1950-4) in America 
with high-grade groups such as air-force pilots and university 
students Suggests that numerous intellectual factors can be dis- 
tinguished in the fields of reasoning, planning, judgment, originality, 
etc. ; and that it would be more profitable to develop tests for these 
than fora vaguely conceived general capacity of intelligence. Which 
of them will emerge as consistent and clear-cut, and as educationally 
ог vocationally important, among other groups (e.g. British gram- 
mar school pupils and students) is still a matter of doubt. 

Conclusions about Factor Analysis.—A great deal of unavoidably 
complex and expensive research will be needed before we can hope 
to arrive at a reasonably complete map, as it were, of the whole 
realm of human abilities, The difficulties are enhanced because, as 
we have seen, the factor patterns which an investigator extracts are 
often dependent on the particular tests and particular testees he 
employs, and because there is much room for subjective judgment in 
the type of factor pattern he chooses. Research at present is rather 
badly co-ordinated, since each investigator sets out to explore fresh 
territory instead of trying to build on the factors already fairly well 
established by previous workers. Further, there is always the danger 
of mathematical rather than psychological considerations getting 
the upper hand. As already mentioned, factorial investigations are 
not revealing the elements out of which the mind is compounded, 
but are classifying test performances with a view to predicting more 
effectively other performances, e.g. in the educational or vocational 
Sphere. But by no means all of the important facets of human 
nature are susceptible yet of accurate measurement. There may be 
abilities, such as executive capacity, intuitive power, goodness of 
teaching, and social intelligence, which are generally recognized in 
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daily life, but which are not yet properly substantiated or analysed 
owing to the inadequacies of our tests. Still more is this true in the 
realm of emotional traits and interests, for reasons which the writer 
has described elsewhere (Vernon, 1953). Yet we cannot hope to 
make successful predictions about an individual child or adult 
unless we know every side of his make-up. It may well be that all 
these attempts at analysing the measurable abilities and traits of 
human beings will never give us a picture of an individual personality 
considered as a whole, that we shall always need to supplement test 
results with subjective judgment and intuition in order to resynthesize 
them into a living entity. But this, too, is a matter of psychological 
controversy which the reader may follow up elsewhere. 

We have no space to discuss the actual statistical techniques of 
factor analysis. Spearman has fully described his method in The 
Abilities of Man (1927). It is not difficult to apply, though rather 
laborious. It is the most accurate method of extracting unitary- 
factor patterns, but it is hardly suitable for bi- and multiple-factor 
patterns. Holzinger's (1941) method is an extension which deals 
readily with bi-factor patterns, though it does not seem to possess 
much advantage over the flexible group-factor techniques outlined 
by Burt (1950). The most widely used technique is Thurstone's 
centroid method, which is well-described by Guilford (1936), and 
is not difficult to acquire. But it has many more elaborate refine- 
ments, for which Thurstone's book (1947), or Cattell (1952), should 
be consulted. Other more accurate techniques, whose aims and 
guiding principles are somewhat different from those described in 
the present chapter, have been developed by Hotelling, Kelley, and 
Burt. The clearest account of the whole subject is that of Thomson 


(1939). 


CHAPTER IX 
MENTAL TESTS 


MENTAL testing is simply a more refined and scientific method of 
doing something which we all of us do every day of our lives, that 
is assessing one another’s abilities and character traits. Our ordinary 
method is to observe a person's behaviour, including his facial 
expressions and gestures, in various situations and circumstances, 
and his speech or written productions. For example, we might watch 
a small boy playing with bricks, and from his skilfulness, or the 
elaborateness of the resulting construction, jump to the conclusion 
that he is quite a bright child. Obviously the basis for this and other 
such judgments is very haphazard and unreliable. There is no sure 
evidence that the boy is acting more, or less, intelligently than the 
majority of children. Nevertheless this performance may quite well 
be refined and turned into a scientific mental test. So-called pers 
formance tests of intelligence are based on just such activities with 
bricks, blocks, and other concrete material. 

Let us then first consider the essential features of a mental test 
Which distinguish it from such everyday judgments and from the 
slightly more scientific school examination. 

1. Standardization of Test Situation.—The first step which the 
tester takes is to standardize the test situation. He will, for example, 
adopt some standard set of bricks and give definite instructions, 80 
that all children who are subjected to the test shall do it under as 
nearly as possible identical circumstances. Until this is done it is not 
legitimate to compare the performances of different children. With 
every good test there is published a detailed set of instructions, to 
which the tester must closely adhere. School and university ex- 
aminers, on the other hand, usually give no instructions beyond 
stating the number of questions to be answered, and they neglect to 
inform the examinees what type of answers they require. Because of 
this, and because of frequent ambiguities in the wording of guestons; 
different examinees may conceive the tasks that are set them in a 


variety of different ways, and it becomes exceedingly difficult to 
compare and mark their answers. 
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The material of a test, its conditions of application, and its in- 
structions are further planned to be readily comprehensible to, and 
suitable to the mental level of, the children or adults for whom it is 
intended. This is another feature which some school examiners 
might be advised to copy instead of setting subjects for composi- 
tions, or other questions, which are far above the heads of the 
pupils. 

We must, however, admit that several widely used tests do not 
entirely live up to these criteria. Some of them were originally con- 
structed and standardized in America; hence they require consider- 
able modification or translation before they are suitable for British 
children. We are forced to employ these ‘second-hand’ tests because 
the production of fresh tests is an extremely lengthy and expensive 
business. Unfortunately different British testers do the necessary 

*re-modelling' in n.difrasorays’ ays, so that the test situation is no longer 
identical for а“ © show th 

D Standarro, | 999 з of ћеех-Тһе mental tester does not ob- 
serve the chinn SS їп Chap. Jis finished product, in the casual 
manner implied k е» е or objecti, but obtains an accurate record 
of them, He insist! ла knowing dehnitely what was the fest response, 
ie. the child's reaction to the previously laid down test situation. 
Either the response is something which can be delimited in quantita- 
tive terms (e.g. so many bricks put into their correct positions in so 
many seconds), or else it is something which can be designated un- 
equivocally as right or wrong.! It is most important that the personal 
opinion of the tester should play no part in this record. Any two 
testers observing and scoring the response should arrive at identical 
records. If this condition holds the test is called an objective one. All 
good mental tests, then, are provided with scoring keys or manuals 
of instructions which enable the recording and scoring to be done 
objectively. 

Our everyday judgments of traits or abilities are, by contrast, 
decidedly subjective, for numerous investigations have demon- 
strated their liability to vary with the person who makes them. Some 
teachers are still convinced that they can assess the intelligence of 
their pupils better than can any test. But the fact is that different 

teachers, assessing the same pupils, give such different opinions that 
the scientific psychologist can put little trust in any of them. The 

* So-called Quality Scales are tests whose scoring constitutes an exception to 


this rule, cf. p. 154. 
МА—11 
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same is true of many school and other examination marks, as will 
appear in Chap. XI. 

3. Norms.—The layman commonly seems to regard the ob- 
served behaviour as having some absolute value or significance. 
There is no doubt in his mind that it indicates high intelligence, or 
poor sociability, or whatever the trait may be. The more cautious 
psychologist realizes that such judgments must be based largely on 
more or less vague recollections of the behaviour of other similar 
children under similar circumstances, and so insists that all grading 
of abilities is essentially a comparative or relative matter (cf. Chaps. 
II and IV). Thus one of the most vital accompaniments of a mental 
test is the norms, by reference to which any child's goodness or 
poorness of performance may be assessed. Unfortunately it is also 
the feature in which many published tests are defective, on account 
of the difficulties already described (pp - 25323). .. 

In the nature of examining it is har the basis for-stablish norms 
for examinations, since they canno/id unreliable. Theie numbers of 
pupils ог students beforehand for 71е, or less, intellige? standardized 
test. We have seen, however, thas this performanceonalizing marks, 

. which is essentially equivalent "the setting up of provisional 
norms, is necessary if the present ambiguities in the significance of 
school or examinations marks are to be avoided. 

4. Reliability and Validity—A single piece of behaviour, 
casually observed, as in our illustration, would probably be very low 
in reliability. A discussion of the statistical methods of determining 
reliability was given above (p. 126), and the reliability of several of 
the more important tests is mentioned later in this chapter. 

The most serious defect in the layman’s everyday judgments is that 
he seldom has any good proof that the observed behaviour validly 
indicates the ability which he deduces from it. It may have arisen 
from some quite different trait or ability, or else from an ability 
which is almost wholly specific (in Spearman’s sense), i.e. not corre- 
lated with anything else. To be reliable a test need only measure 
accurately the ability to perform that test, but to be valid it must also 
correlate effectively with other measures of whatever it is supposed 
to test. The mere inspection of the content of a test is seldom sufficient 


for the determination of its validity, although it is a necessary 
preliminary, 


Such inspection of educational tests and examinations may reveal 
their failure to cover the ground which the pupils should have 


MENTAL TESTS 151 


accomplished. Thus school and university examinations are often de- 
fective because their questions are too few in number, or are badly set. 
It will be obvious also to a tester that the content of some of the pub- 
lished educational tests is unsatisfactory, either because they are 
designed for pupils whose tuition differs widely from that current 
among his testees, or because they are out-of-date. But apart from 
such flaws, many tests involve group or specific factors which detract 
from their effectiveness as measures of some general ability. For 
example, a test of reading may be based on the capacity for pro- 
nouncing difficult words. This may or may not indicate children’s 
capacities in other aspects of reading, such as their fluency, or their 
comprehension of what they have read. Investigation by means of 
correlations is essential for establishing these points. Again, the 
technique of the test, i.e. the way in which the testee is required to 
express his ability, always seems to introduce an irrelevant factor. 
In Chap. XI we shall show that in an ordinary essay-type of exam- 
ination, the translation of the examinees' knowledge into essay-form 
is one such factor, and in Chap. XII the examiner's translation of his 
questions into new-type or objective form will be recognized as an- 
other. Temperamental factors, health, fatigue, and the like also 
influence a testee's performances at many tests, and so reduce their 
validity as measures of ability. 

Most educational and vocational tests are assumed not only to 
be valid indicators of some present general ability, but also to be 
prognostic of future achievement. Their predictions should then, of 
course, be followed up and compared with subsequently obtained 
measures of achievement. In the educational field there have been 
many such investigations whose results usually show our tests and 
examinations to possess disappointingly poor validity (cf. Chap. XI). 
Vocational psychologists have also tried to correlate their predictive 
tests with some measure of output, or with estimates of efficiency 
obtained from supervisors or foremen. As shown elsewhere (Vernon 
and Parry, 1949), however, it is not always easy to secure a satisfac- 
tory criterion of success or failure in an occupation against which 
the test results can be checked. 

The validation of intelligence tests, or of tests of capacities such as 
practical ability, verbal ability, and the like, is particularly difficult, 
Since we possess no objective criterion of intelligence, etc., with 
Which to compare them. Ratings have sometimes been used, that is, 
teachers? estimates of the intelligence of children in their charge, but 


152 THE MEASUREMENT OF ABILITIES 


these are highly fallible. If psychologists are unable to agree as to 
what they mean by intelligence, it is unlikely that laymen's views will 
be any more concordant. Indeed, if teachers could judge it accurately, 
there would be no need for tests. The main defect of such ratings is 
that they are strongly biased (often quite unwittingly) by considera- 
tions such as the children's industriousness and good school be- 
haviour. Parents' judgments are still less trustworthy. Several group 
intelligence tests have been validated by comparing their results with 
the results of one of the Binet scales. But the validity of these scales 
is open to doubt. The items which they contain have always been 
chosen so as to differentiate older (hence presumably more intelligent) 
children from younger (less intelligent) ones. This criterion is, how- 
ever, dubious in value, since it would admit tests of, say, height and 
weight which have very little validity as indicators of intelligence. Yet 
we do not rely merely on subjective judgment when we state that these 
must be excluded from a test of intelligence, for in the absence of 
any external criterion of validity, we make use of what has been called 
the method of internal consistency. 
This method consists either of factorial analysis, or of a modifica- 
tion thereof. All the test items or sub-tests are selected in the first 
place on the basis of subjective opinion; they must appear to the 
tester to inyolve the exercise of intellect. Thereafter they are studied 
empirically. Either each item is compared with every other, or with 
the sum of all the rest of the items, or sets of items such as the sub- 
tests in a group intelligence test are compared with other sets. The 
inter-correlations will serve to show whether or not all the items are 
consistent with one another—whether they are measuring one and 
the same thing. And items that fail to conform to this criterion are 
excluded from the final, published, version of the test. The statistical 
conditions imposed by Spearman in his factorial analysis of group 
intelligence tests are much more rigorous than those used by Terman 
in revising the Binet scale. Both, however, prove that there is a gen- 
eral factor running through all their test items or sub-tests. Spearman 
and his followers wish to measure nothing but g, that is, they aim at 
a uni-factor pattern, and so exclude a number of sub-tests which fail 
to conform to such a pattern. The Binet test, on the other hand, al- 
though it certainly embodies a g-factor, contains a hotch-potch of 
heterogeneous tasks which would, if fully analysed, yield several 
group factors (a bi-factor pattern), or even a set of multiple factors 
(cf. Burt, 19395). The effects of these different techniques of test 
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construction will be discussed more fully below. What matters for 
our present purpose is that when a test, or a set of tests, is internally 
consistent, this signifies that it will validly predict any other be- 
haviour which involves the same general factor. Intelligence tests 
work fairly effectively because scholastic success, and many of the 
other intellectual activities of everyday life, are known to be depen- 
dent largely on the g which they measure. 

Test Construction.—Before proceeding, we would strongly urge 
amateur testers and teachers not to attempt to make up their own 
tests, apart from new-type ones for school examination purposes. To 
devise test items similar to those which appear in published tests is 
not difficult, but a great many technical features enter into the selec- 
tion of suitable items and into their standardization, in addition to 
those described here. The points which we have outlined are meant, 
not to show how to construct a test, but to enable amateurs to choose 
more wisely the tests which will be most suitable for their purposes, 
and to give them some assistance in understanding what they are 


doing when they apply published tests. 


TYPES OF TESTS 
At the end of this chapter is given a classified list of most of the 
mental tests available in this country. The main headings are: 


I. Attainment, Achievement, or Educational Tests. 
A. Examinations. 
B. Standardized Educational Tests. 


1. Individual reading tests. 
2. Language development and Vocabulary. 


3. Individual oral arithmetic. 
4. Group tests. 


1I. Individual Intelligence Tests. 
A. Versions of the Binet-Simon scale. 
B. Wechsler scales. 
C. Miscellaneous scales. 
D-E. Performance Tests. 


III. Group Intelligence Tests. 
A. Oral. 
B. Non-verbal. 
C. Verbal. 
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IV. Aptitude and Special Ability Tests. 
V. Personality Tests and Assessments. 


The last two sections merely mention some of the tests commonly 
applied in this country, but the first three aim to be comprehensive. 
Although we shall not attempt to describe particular tests, the follow- 
ing general explanations of, and comments on, the main types may be 
useful to those who are not already thoroughly familiar with the field. 

Examinations and Educational Tests.—Examinations are still the 
most important method of educational measurement, and so come 
at the head of our list. Their defects, and the possibilities of improv- 
ing them, or of employing substitutes, are discussed in Chaps. XI- 
XIV. Standardized educational tests must be contrasted with them, 
for although they are made up of the same kinds of items or questions 
às occur in objective or new-type examinations, their functions are 
quite different. They are not designed to cover the work done by 
particular classes of pupils or students. Nor are they meant, pri- 
marily, for selecting pupils who are most likely to do well in some 
future work, though they are nowadays quite commonly used for 
assessing attainments at the 114- selection stage. Rather they senye 
to show what is the standing of a class, or of an individual pupil, 
relative to other similar pupils, either on some general subject such 
as English or arithmetic, or on special branches of, or stages within, 
a subject (these latter being known as diagnostic,! or analytic, tests). 
The fact that they are published, and their expense, obviously make 
them unsuitable for use as school examinations. By means of them, 
however, a teacher can usefully grade his pupils at the beginning of 
the school year, so as to find which ones will need most coaching, or 
in which parts of a subject they are most advanced or backward. If 
the norms are adequate (cf. Chap. IV), they will tell him how his 
pupils compare with pupils in general of the same age. The psycholo- 
gist at a Child Guidance Clinic, or the vocational adviser, finds them 
especially valuable for assessing a child’s strong and weak points. 

Quality Scales.—Although most educational tests consist of large 
numbers of short items, both so as to cover the subject as thoroughly 
as possible, and so as to admit of objective scoring (cf. Chap. XII), 

some educational products which are too complex to be assessed in 
this piecemeal fashion can nevertheless be measured by ‘quality’ or 


' Cf. the excellent discussion of diagnostic tests by Schonell (1950); 


also 
Whilde (1955). 
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‘product scales.’ Such standardized scales are available for grading 
handwriting, composition, drawing, etc. Each of these consists of a 
series of specimen products which have been carefully chosen as 
representative of certain levels of achievement. The testee is told to 
write or draw the same matter in his normal manner, and his product 
is, as it were, slid along the scale, or compared with the specimens, 
until one is found to which it corresponds in achievement. It is then 
given the score of this specimen. Such grading cannot be made 
wholly objective, but there is evidence that it improves with practice. 
Different testers will, when experienced, award the same or nearly 
the same score to a child's product. Certainly the process is fairer 
than grading without any external standards. Hence the adoption of a 
similar plan in the marking of school work was advocated in Chap. II. 

Burt's scales all consist of specimens selected from a large number 
as being typical of the work of average 5 + ,6 --,... 14 + child- 
ren. Schonell's composition scales, covering the 7- to 14-year range, 
are similar. Cattell’s handwriting scale for school-leavers and his 
drawing scale for adults are merely graded I to V, representing 
different degrees of merit at these age levels. 

Quantitative Tests and Scales.— The scoring of quantitative tests 
either by the *rate' or *power' systems, or by a mixture of these, has 
already been described in Chap. II. It will be noted that scarcely any 
tests are available much above the 14-year level, and that very few 
deal with subjects outside the 3 R’s. One obvious reason for this is 
that secondary school curricula vary so widely, and that only rela- 
tively small numbers study a foreign language or a science, etc., by 
comparable methods at comparable levels. Nevertheless it would not 
be difficult to devise and standardize tests in a number of subjects 
suitable for some fairly homogeneous groups of pupils or students, 
such as those in modern or in grammar schools, in teacher training 
colleges, or provincial universities. Certainly our tests lag far behind 


those available for similar purposes in America. 


INTELLIGENCE TESTS 
Intelligence tests closely resemble quantitative educational tests 
(to which they are, of course, historically prior) in their construction 
and scoring. Thus they may consist of: 
(а) A series of tasks graded in difficulty (power plan). 
(b) Tasks such as performance tests where the speed or number 


of moves are recorded (rate plan). 
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(c) Sets of questions which are roughly graded, but where the 
Score depends on the number answered within a certain time limit. 

Group tests generally include a ‘battery’ of some half-dozen (be- 
tween about four and twelve) 'sub-tests,' each containing some twenty 
(between ten and fifty) questions or items. Alternatively they are 
arranged in ‘omnibus’ form, where items of all kinds are mixed up. 
In the battery type, each sub-test has its own instructions and sample 
items, and is timed separately. In the omnibus form, the instructions 
and samples may be printed along with the questions, or included 
in a preliminary practice sheet, and one time limit suffices for the 
whole test. An omnibus test is therefore somewhat easier to give, but 
a test battery has the advantage of providing rest pauses every few 
minutes. 

Almost all intelligence tests thus consist of rather large numbers 
of items. Performance tests are exceptional, but it is desirable to 
apply several, say six or more, of these. The tests are always applied 
and scored in a standard manner, and the scores either take the 
form of Mental Age years (as in the Binet and some other tests), ог 
percentiles or standard score I.Q.s (cf. p. 67). : 

The Innateness of Intelligence.—Many of the standard textbooks 
describe intelligence as a general innate capacity underlying all our 
abilities, dependent on the genes that we inherit, and therefore fairly 
constant throughout life. Thus intelligence tests differ, it is said, 
from attainments tests in that they consist of tasks whose solution 
depends upon native ability rather than upon acquired experience 
or upbringing. This view, however, seems rather unrealistic. For 
although there is ample evidence of the existence of inherited 
differences in intellectual potentiality, we are quite unable to observe 
or measure such a capacity. The good or poor intelligence which we 
observe in everyday life, at school or at work, and which we try to 
measure by our tests, is the product of innate factors and environ- 
ment. As Hebb (1948) and Piaget (1950) have shown, the young 
child gradually builds up or acquires his perceptions and habits and 
his thinking through contacts with his physical and social environ- 
ment; and the level that he reaches at any age is affected considerably 
by the kind and amount of intellectual stimulation that the environ- 
ment has provided up to that age. This level is fairly stable under 
normal conditions of upbringing—more so than would be gathered 
from the publications of some left-wing theorists and of some 

extremely enyironmentalistic American investigators. But it does 


MENTAL TESTS e 


fluctuate more than would be expected from the statements of some 
of the early pioneers of mental testing. For example, over the prim- 
ary school age range from 6 to 11, or over the range from 11 to 18 
or so, the correlation between two similar tests is very far from per- 
fect, but it does not sink below + 0-70. According to the Probable 
Error of estimate technique (p. 127), this means that the average or 
median child alters by about 7 I.Q. points up or down on retesting. 
Many are more stable than this, but a very few show much larger 
gains or losses of 30 or more points (4 x P.E.); and 17% are liable 
to alter 15 or more points either мау. At the same time the stability 
is great enough to allow us to make useful educational or vocational 
predictions, for the majority of children, over several years. Par- 
ticularly striking evidence of this fact is provided by Terman and 
Oden's (1947) follow-up of children with very high I.Q.s over 25 
years. A fuller discussion of these controversial questions, and of 
the psychological nature of intelligence, may be found elsewhere 
(Vernon, 1960). 

Thus a better way of looking at intelligence and attainments is to 
say that the former refers to the more general qualities of thinking— 
comprehension, level of concept development, reasoning and grasp- 
ing relations—qualities which seem to be acquired largely in the 
course of normal development without specific tuition; whereas the 
latter refer more to knowledge and skills which are directly trained, 
and whose acquisition and retention depend more on the person's 
interests and on his personality traits such as industriousness. There 
is no sharp dividing line between them. Indeed some items such as 
vocabulary are to be found either in intelligence or in English attain- 
ment tests. Again some of the items in the Binet and Wechsler 
scales, and in such well-known group tests as Army Alpha, are 
based on general knowledge or information. Although such know- 
ledge is obviously acquired, these work very well as measures of 
intelligence, since the more intelligent are more likely to pick it up, 
in or out of school, and to retain it.* Т.О. and Е.О. always show very 


! With repeated retestings variations are greater, and they may be increased 
by numerous factors such as above average ability, standard deviations greater 
than 15, inaccurate norms, or practice and coaching effects. They are much 
larger, also, if the first test is given before the age of 6. Indeed so-called develop- 
mental quotients obtained from tests of 0 to 24-year-olds have virtually no 
predictive value for later I.Q. 

* A more serious criticism is that girls ап nt 
Poorer in general knowledge than boys and теп; and it is preferable to use items 


for intelligence tests which show no large sex differences. 
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close overlapping, though some children do come out markedly 
better on one than the other; and investigations such as Burt's (1937) 
and Schonell’s (1942) into educational backwardness reveal that 
those with relatively low attainment are usually handicapped by 
bad home conditions, poor health, ineffective teaching, or emotional 
maladjustment, or a combination of these. Thus even though we must 
admit that intelligence and the 1.0. are only partly innately deter- 
mined, it is still true that they give better indications of capacity for 
acquiring further education than do present attainments alone. 
Intelligence tests are more fair to children from remote rural 
schools, and others whose schooling has been affected, e.g. by 
absences, and to children of inferior background; although all these 
factors themselves do have adverse effects on the I.Q. y 
The majority of tests employ the verbal medium since man's 
higher thinking and reasoning capacities generally involve the use 
of words, numbers, or other symbols. Moreover one of the most 
important, if not the most important, use of intelligence tests is that 
of predicting scholastic aptitude, and as school work is itself pre- 
dominantly verbal, verbal tests are certainly the most useful. But 
this means that people's performance at such tests is considerably 
affected if they have had inadequate opportunities for acquiring 
language, for example the foreign-born or bilingual, the deaf, or 
children of gipsies or canal bargees who receive little or no schooling. 
Many testers therefore prefer non-verbal tests, consisting of prob- 
lems based on pictures, abstract diagrams or concrete (performance) 
material. While these may be more fair to the verbally handicapped. 
they cannot be considered as giving a truer measure of innate 
ability. A British or American child, for example, has much more 
opportunity than an African child for acquiring skills with pictures 
and blocks, and a middle-class British child probably has advantages 
over a lower working-class child. Nadel (1939) and Biesheuvel (1949) 
show convincingly the fallacies of assuming that racial differences in 
intelligence can be measured by such tests, or indeed that any type 
of test is ‘culture free.’ Tests of the concrete or pictorial type are 
valuable not only for the deaf and the illiterate, but also for younger 
children who are less interested in verbal tests and less accustomed 
to verbal tasks than older persons. But these, and other non-verbal, 
tests give decidedly poorer predictions of educational capacity, 
except possibly in the technical, scientific, and mathematical fields. 
Individual Tests.—Up to 1937, the majority of testers of individua] 
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children used the Stanford Revision of the Binet-Simon scale, pub- 
lished by Terman in 1916, or Burt’s unpublished restandardization 
of this Revision. The New Stanford Revision, or Terman-Merrill 
scale then became the most popular version (Terman and Merrill, 
1937). This is superior both because it contains very full instructions 
for application and scoring, which help to obviate the confusions 
and errors previously so common among inexperienced testers, and 
because of its greater extensiveness. It is issued in two forms, L and 
M, both of which cover almost the whole range of intelligence from 
that of 14-year children to that of adults. At the same time it has its 
difficulties and weaknesses: 

(a) The original American yersions cannot be applied as they 
stand in this country, because they are replete with words and phrases 
that require ‘translation’ into English. Some of these have been 
altered in the edition published here, but many other desirable 
modifications have been suggested. Of the latter, the simplest and 
the most accessible are those prepared by the Scottish Council for 
Research in Education (Kennedy-Fraser, 1945). Again the typical 
responses quoted in the scoring instructions are of little use since, 
even after translation, they represent the thoughts and expressions 
of American children. The tester must try to decide whether his 
testees’ responses correspond in intellectual level to these accept- 
able and unacceptable responses. 

(b) None of the British versions has been restandardized. Thus 
the tester must expect to find that many of the tests are misplaced in 
order of difficulty. Children may fail all the tests in one year, and 
yet pass one or two in a higher year. The same applies to items within 
a test. Thus the Vocabulary words and some of the Absurdities, 


etc., require rearrangement. 


(c) It is difficult to obtain toys 
needed for tests of younger children, which are precisely comparable 


to the originals, and this material is very expensive. Fortunately it is 
used only at Mental Ages of 4 and less. For almost all children of 
primary school age, the printed cards and beads and blocks are 


sufficient. 

(d) Most of the tests near th 
age or Superior Adult (M.A.s 1 
be suitable for adults, although t 
among children of above average 

(е) Despite the care taken over t 


and other practical material, 


e top end of the scale, labelled Aver- 
5-22), are too childish in content to 
hey are useful for discriminating 
and very superior intelligence. 

he American standardization, the 
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tests from about 11 years onwards appear to be considerably too 
easy, and testers complain that far too many adolescents are getting 
high 1.0.5 in the 130-160 range. It may be that these queer results 
arise merely from variations in the Standard Deviation of I.Q.s at 
different age levels (cf. p. 71); and Fraser Roberts and Mellone 
(1952) supply a useful correction table which allows for such varia- 
lions. But, as pointed out earlier (p. 73), no test based on M.A- 
Scoring can be expected to yield really satisfactory results. 

The Wechsler intelligence scales surmount several of these 
difficulties. The original Wechsler-Bellevue test (Wechsler, 1939) is 
now replaced by W.A.LS. (Adult Intelligence Scale), which has 
norms from 17 to 75 years. The W.LS.C. (Intelligence Scale for 
Children) is applicable from 5 to 15-- years, but is not—like 
Terman-Merrill—suitable for pre-school children. Each scale in- 
cludes 5 or 6 verbal sub-tests and 5 performance tests, and thus yields 
a verbal and a performance, as well as a combined, I.Q. Here too 
Some adaptations are needed for British testees, but standard modi- 
fications have been prepared by the British Psychological Society- 
The pictures, blocks, etc., needed for the performance tests are un- 
avoidably expensive, but the standardization and Scoring are so much 
Superior to those of any other scale that most psychologists con- 


cerned with individual testing of children and, especially, of adults 


prefer the Wechsler. 


Valentine's Intelligence Tests for Young Children (1948) are 
particularly suitable for 2- to 5-year-olds, though also providing а 
Short battery of individual tests for children up to about 11 years. 
This scale has the advantage that all the necessary materials are 
included in the handbook, or can easily be constructed by the tester. 
For 0-2-year-olds, Griffith’s (1954) Mental Development Scale is the 
most thorough, and the best standardized in Britain. It yields 
Separate quotients in each of five fields of development—Locomotor, 
Personal-social, Hearing-speech, Eye and Hand, Performance, and 
a combined or general quotient. In testing at this level it is especially 
important that the tester be thoroughly trained and experienced in 
managing babies and using the correct materials and instructions. 

Performance Tests.—As we have seen, these practical tests may be 


* The original version may, however, continue to be used for some time in this 
Country, partly because more thorough English adaptations have been worked 
out, and partly because it normally takes only 45 minutes to apply as compared 
with over 60 for W.A.LS. 


„л. ТЖ э MLL 


MENTAL TESTS 161 


used when verbal ones are inapplicable, or they may usefully supple- 
ment the predominantly verbal Binet-type test. Their chief merits are 
their greater attractiveness to children, and their capacity for bring- 
ing out a variety of temperamental reactions. A testee’s manner of 
approach to such tests, and his behaviour when confronted with 
difficulties, throws much light on his impulsiveness, persistence, 
complacency, and other qualities, which cannot as yet readily be 
tested or measured objectively. 

The majority of performance tests suffer from the practical draw- 
backs that they are very unwieldy and difficult to transport, and 
extremely costly. Some testers are suspicious of them because 
different tests often give very jnconsistent results. A child with a 
Binet M.A. of 9 years may, for instance, pass performance tests at 
anywhere from the 7- to the 12-year levels. This is partly due to poor 
standardization, and to variations in the actual materials and 
methods of application. As they have to be applied individually, the 
task of providing accurate British age norms is a difficult one (cf. 
Vernon, 1937). Even if they were perfectly standardized, we should 
still expect considerable inconsistencies, since their inter-correlations 
show that each test involves a large specific factor, or unknown 
group factors. Many of them are also poor in reliability, and the 
evaluation of this feature is difficult because of the large practice 
effects if they are applied twice to the same testees. Thus the mental 
tester should never place much reliance on a single performance test, 
nor on two or three. A small number may be sufficient to indicate 
whether a Binet І.О. is likely to have been distorted by some verbal 
defect, but the median or mean score from at least half a dozen, or 
the total result from a battery such as Drever-Collins, W.A.LS. or 
W.LS.C., should be taken if a reliable performance test M.A. or I.Q. 


is desired. 
Undoubtedly then their validity, 
‘practical ability,’ is poor. True, we have seen that they may involve 


the spatial ability factor, k, to some extent, and so be useful in 
selection for mechanical occupations and technical courses. But 
the view that they give a better 


there is no real justification for 
indication of intelligence for daily life purposes than the more 


reliable verbal type of test. It is interesting to note that several of 
them are markedly affected by emotional maladjustment and neuro- 
ticism. Children tested at a Psychological Clinic tend to make poorer 
scores than on Binet. Delinquents, on the other hand, who are often 


either as measures of g or of 
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very backward educationally, do tend to do comparatively well on 
performance tests (cf. Earl, 1939). 


GROUP TESTS AND INDIVIDUAL TESTS 


The merits and defects both of individual tests such as Terman- 

Merrill, Wechsler and performance tests, and of group verbal or 
non-verbal tests, may best be brought out by contrasting them under 
the following series of headings : 
l. Age Range.—Individual tests must be used with very young 
children because it is practically impossible to hold the attention of 
4 group, or to get them all to act alike at the same moment. Further, 
itis not natural for children to apply their intelligence to symbols on 
paper—either words, pictures or diagrams—until they have settled 
down to formal Schooling. Manipulation of concrete objects ог 
individual conversation With the tester constitute much more аррход 
priate testing media, Group tests are available which claim to 
measure Mental Ages down to 6 or even 5 years. The instructions for 
these must, of course, be given orally. But their value is very doubtful, 
for the above reasons. They may sometimes be useful for classifying 
Pupils aged 7 +, i.e. children whose M.A.s are likely to run from 
5 to 9, 

By 9 or, better, 10 years, children can read instructions for them- 
Selves, are thoroughly accustomed to class work, and can readily 
deal with a Variety of verbal problems. This is therefore a very suit- 
able age at which to begin group testing. Above 14, group tests have 
definite advantages over individual, both because they can readily 
be made difficult enough even for the highest intelligence levels, and 
because their items are much less apt to appear silly and childish to 
Suspicious adolescents or adults. Nevertheless, the W.A.LS. ог 
W.LS.C. are preferable for abnormal cases who would not settle 
down readily to the group te$t situation, e.g. the feeble-minded or 
patients in a mental hospital or clinic. 

2. Ease of Application.—Group tests are easier to procure, to 
apply and to Score, and they demand less training and experience on 
the part of the tester. Possibly, however, these are disadvantages. 
We assume too readily that group test performance is unaffected by 
the tester and the atmosphere he creates. As shown in the next 
chapter, there are many errors of application and interpretation that 
inexperienced testers often commit. ч 

3. Time.—Needless to say, the time saved by applying group 
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instead of individual tests is enormous. For instance, a class of forty 
can usually be tested in less than an hour, and their answers scored 
in four to eight hours, whereas an experienced Binet tester needs 
about half an hour for each young child, three-quarters for an older 
one, and an inexperienced tester much longer still. The point, how- 
ever, is whether this period per child is worthwhile. The majority of 
psychologists would certainly say that it is, and that an individual 
test can reveal much better than a group test things which the 
child’s teachers and parents may have failed to recognize even after 
years of acquaintanceship. Probably the best compromise would be 
for a school psychologist or infants’ mistress to test each child 
individually at 6, for a group non-verbal or oral test to be given in 
class at 8, and verbal group tests at 10 and 14 (assuming that the 
results of selection tests are available at 11 +). Such results would be 
entered on a cumulative record card, and would provide a very 
reliable picture of each child’s intellectual development. Individual 
tests would need to be given from 7 onwards only to exceptional 
pupils whose group test results were highly irregular, or about whom 
some importarit decision had to be made for purposes of educational 


or vocational guidance. 
4. Time Limits.—Most of the items in the Binet scale and many 


individual performance or educational tests are untimed, whereas 
the great majority of group tests have to be done under speeded 
conditions. Are not, the critics ask, some children slower but surer 
than those who score well on these timed tests? As indicated pre- 
viously (p. 132), the answer is more negative than positive, but the 
question is a complex one. If time allowances on speeded tests are 
increased, slower children certainly improve their scores, but they 
still generally remain poorer than quick ones. The rank order is but 
little affected and (according to the principles of Chaps. II and IV) 
it is relative, not absolute, scores that nfatter. At thesame time experi- 
ments do show that with "highly speeded and rather easy tests, 
children’s performance is considerably affected by their quickness 
and by whether they aim primarily at speed or at accuracy; whereas 
with very difficult, untimed material their persistence or willingness 
to go on trying may affect their results as much as does their ability. 
For most purposes it is desirable to compromise between these 
extremes, and to avoid mixing up our measures of ability with what 
are essentially temperamental or work attitude factors. Furthef, it 
would be administratively most inconvenient to. use untimed tests, 
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which different testees would take very different times to complete. 
Hence it is possible that some of our present group intelligence and 
attainments tests (particularly in arithmetic) are unduly speeded. If 
we inclined towards tests with longer time limits and more difficult 
items, very few children would alter their relative scores appreciably, 
but it is likely that our predictions of future educational achieve- 
ment would be slightly improved. We know, too, that emphasis on 
speed unduly handicaps older persons, say 30 years upwards, ence 
ample time limits are preferable among adults. It may be that it also 
upsets emotionally excitable or nervous children, and this is one 
reason why Clinic psychologists prefer to use individual tests where 
the testee gives his answer at his own pace. 

5. Objectivity.—With good group tests the conditions of 

` application and the scoring are much better standardized than with 
most individual tests. Very little is left to the personal decision of the 
tester, and the marking is, or should be, completely foolproof. In 
actual practice many group testers still manage to commit gross 
errors, some of which will be pointed out in the next chapter. And 
modern individual tests like the Terman-Merrill and Wechsler afford 
much less scope for subjective whims than did the older ones: 
Nevertheless individual testers often have to judge the correctness of 
doubtful responses, and may, without departing from the instructions 
in the handbook, alter the testing conditions in a variety of ways. Yet 
there seems to be no evidence of serious discrepancies between the 
results of trained testers, although untrained ones may make grave 
mistakes, even sometimes leading to the wrongful certification of 
mentally defective children or adults. Reliability coefficients over 
short periods are very high, though greater difficulties leading to 
lowered reliability occur in testing children below the age of 5, and 
among maladjusted children or adults (cf. Tulchin, 1934). The Ose 
perienced tester should, however, be able to recognize when his 
results are liable to be affected by extreme unco-operativeness Or 
instability in the testee, and to treat them with due caution or to 
arrange for a re-test when desirable. 

Actually the rigid objectivity of the group test is distrusted by 
Clinic psychologists, since they know that the same physical situa- 
tion may mean very different things to different children. The 
flexibility of individual testing is to them a great advantage which, so 
far is known, they do not abuse. Two 

6. Dependence on Health and Mood.—Botb group and individual 
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tests seem to be much less affected by these factors than might be 
anticipated. Experiments show little and often no difference between 
the scores of children when they are fatigued or ill, and when in good 
health, nor between those who are strongly motivated (e.g. by the 
ry rewards) and those who are merely encouraged 
in the ordinary way. Studies of the latter type sometimes show that 
testees who are making more effort may attempt more test items, but 
they also make more mistakes so that their scores are scarcely 
altered. There seems then to be little evidence for such statements as 
that the examination conditions of group testing make some testees 
unduly nervous, or that others are stimulated to do better by the 
competitive nature of the testing situation. All such conclusions, 
however, are based on average tendencies, which admit occasional 
individual exceptions, and Binet testing certainly indicates that a few 
testees are so resistant or timid that they fail to do themselves justice. 
More extreme differences in attitude to the test can undoubtedly 
distort the scores, and certain physical conditions such as bad eye- 
sight must have a like effect. No intelligence test possesses perfect 
reliability; and conditions such as these may be at least partly 
responsible. Variations in group test performance, as well as in 
individual, are known to be greater among maladjusted children. Yet 


at the same time there is no evidence that excitement or anxiety 


upsets the results of 11+ selection tests among more than a small 
minority. 
The great advantage of in 


promise of moneta 


dividual over group testing is that in the 
former the tester can generally discern the influence of abnormal 
conditions, whereas in the latter he has scarcely any control over, or 
insight into, them. It may be that while high scores on group tests are 
relatively stable and significant, low scores can arise from a variety 
of sources other than poor intelligence. However, a lot more careful 
research is needed in order to determine definitely the part played by 


such factors. А : 
7. Dependence on Practice or Coaching —Most intelligence tests 


are published, hence anyone can procure them, coach testees on 
them, and ensure considerably increased scores. It is a most foolish 
thing to do, since a child who is made to appear unduly intelligent 
may also legitimately be expected to be capable of unduly difficult 
school work. But whenever examinations for secondary school 
entrance, or for other competitive purposes, include intelligence 
tests, teachers and parents who wish the children to do well will 


M.A—12 
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naturally be tempted. In individual testing it is fairly easy to recog- 
nize when children have been coached or practised, since their glib 
answers are very different from the answers of children who are 
puzzling out the tests for the first time. It is less easy to know what 
allowance to make. In group testing, on the other hand, one may 
suspect that scores are too high all round, though one can never be 
certain unless (as has actually occurred) a whole class answers 19 
test items out of 20 correctly, the twentieth being one where the 
teacher was mistaken. 4 
In order to prevent direct coaching, most Educational Authorities 
obtain tests which are freshly constructed and privately standardized 


every year by Moray House or other organizations. But copies of 


older, more or less parallel, versions are readily available, and there 
are large sales of booklets which contain test items very similar to 
those in the unpublished tests. Particularly at the 11+ stage, and in 


areas where the provision of grammar school places falls far below. 


the demand, intensive coaching in schools, at home, or by outside 
tutors, is known to be widespread. Not only does this tend to distort 
the normal school curriculum (cf. p. 199), but also it produces severe 
strain and anxiety among many of the pupils. 

The topic has received considerable publicity, and numerous 
investigations have been carried out. These show that coaching and 
practice on intelligence test materials do make an appreciable 
difference at the selection borderline, though they are nothing like as 
effective as parents and many teachers imagine. Here we should 
distinguish practice in taking complete intelligence tests, without 
further instruction or knowledge of the right answers, from coaching, 
where items are explained and hints given. A single previous practice 
test raises the average I.Q. by about 5 points, and further practices 
yield diminishing returns up to a total of about 10 points. After 
Some five tests there is no further improvement, and I.Q.s fluctuate 
irregularly or even begin to decline. When combined with coaching, 
the effects are somewhat larger, the average maximum rise being 12 
to 15 points. But these figures refer to children or adults who have 
never seen a group intelligence test before. When, as is usually the 
case nowadays, they are fairly familiar with tests, the effects may be 
only half as great. Moreover it has been shown that the effects are 
highly specific. Thus the kind of coaching which is given from pub- 
lished books of miscellaneous items, and which does not include the 
taking of a complete parallel test under examination conditions, is 
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remarkably ineffective, producing barely a 5 points rise. Again, the 
maximum improvement occurs remarkably quickly; if practice or 
coaching amount to more than a few hours they seem to defeat their 
own ends. 

We have admitted earlier that good schooling and home back- 
ground do genuinely stimulate the development of intelligence. But 
it is clear, from the facts just stated, that intelligence cannot be 
‘taught’ in at all the same sense as school subjects are taught. Prac- 
tice and coaching merely provide a limited facility in dealing with the 
rather roundabout and artificial types of items that occur in most 
group tests, in understanding and following the instructions quickly, 
and in the concentration of effort. Thus they have distinctly less 
effect on the more natural types of items used in individual tests, 
where the testec supplies his own answer in his own words, than they 
do on most multiple-choice or recognition-type group test items. 

In areas where coaching is at all common, the only fair solution 
seems to be to allow a limited amount of practice and instruction on 
parallel tests in the schools. This will give uncoached children a 
reasonable familiarity with the sort of tests they are to take, and will 
generally bring them to the level of the more intensively coached. 
But it would be better still if tests were used primarily for guidance 
rather than for selection, since then the incentive to coach would be 
removed. At the same time, the mental tester can never afford to 
ignore practice or coaching; the amount and kind of previous ex- 
perience of tests that testees have had does make a difference. For 
example, in any experimental investigation involving the comparison 
of two or more tests, the scores on those taken later will be to some 
extent affected. The educational psychologist should realize that 
published test norms may be inapplicable to children who are 
unduly ‘test-sophisticated,’ and should usually avoid giving tests in 
any school at intervals of less than, say, two years. 

8. Validity.—The various versions of the Binet test, as already 
mentioned, measure a poorly analysed hotch-potch of abilities, and 
they have been severely criticized by Spearman, Cattell and others on 
account of the weaknesses of their statistical construction and 
theoretical foundations. Burt (19395) likewise points out the hetero- 
geneity of item content. Perhaps still more serious are the variations 
in content at different age levels; for example, the verbal bias in 
Terman-Merrill becomes much more pronounced at the top end. 
Thus the scale cannot be said to be measuring the same intelligence, 
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or even the same pattern of abilities, throughout. A further conse- 
quence of this heterogeneity is that it tempts testers to make dubious 
subjective analyses of types of items, as when they note that a child 
does especially well on the more practical items, or badly on the 
memory items, and jump to the conclusion that he is a ‘practical’ 
child in daily life, or has poor ‘memory’ for school work. The 
experienced clinical psychologist can obtain many suggestive hints 
as to qualitative features of the child’s intellect and temperament 
from his responses to particular items, or groups of items, but they 
are only hints which require to be followed up in far greater detail 
before they can be regarded as of any diagnostic value. Qua test, it 
only yields the one general score, which can be regarded as rather à 
vague mixture of g, v : ed, and other factors. 

In contrast, the well-constructed group test is highly homo- 
geneous, each of its sub-tests or types of items involving much the 
same factor pattern. Spearman would have claimed that it was an 
almost pure measure of g, though we realize now that the content 
and form of the items, together with difficulty level and timing 
conditions, do introduce other factors. It seems paradoxical, then, 
that most psychologists prefer the relatively unscientific Terman- 
Merrill test when they are called on to make any decision involving 
the intelligence of an individual child. They regard it as their most 
valid mental measuring instrument, and as giving the best available 
index of a child's general intelligence. That less trust can be put in 
group test scores is shown by the fact that different group tests 
often give somewhat discrepant results. If they are fairly similar in 
make-up the correlation between them (over a few days or weeks) is 
likely to be between 0-8 and 0-9 in a representative group. But with 
greater diversity, as between verbal and non-verbal tests, the correla- 
tion may sink to 0-7 or below, and discrepancies between a child’s 
two I.Q.s may reach 30 points or more, though the average or median 
difference should still not exceed about 7 points. At the same time, 
group verbal tests actually give at least as good predictions of future 
educational standing as do the Terman-Merrill or Wechsler, among 
normal children aged 10 upwards and adults. The superior validity 
of individual tests is thus partly illusory, though it probably does 


hold good for younger children, particularly in regard to intelligence 
outside the school. 


The probable explanation of this situation is that the intelligence 
which we recognize in daily life, in school or out of it, is not pure g, 
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but a mixture of abilities. Binet and his followers chose items largely 
from children’s everyday behaviour and thinking, regardless of their 
factor content, and the very impurities that they included happen to 
overlap very well with the mixture we want. In the group test, how- 
ever, the items are mostly of a more restricted and artificial type, 
though they happen to coincide fairly closely with the kind of in- 
telligence required for passing examinations. The child is forced to 
express his abilities through a less natural medium, and this intro- 
duces such components as facility-in-doing-multiple-choice-items- 
at-speed, which are irrelevant to anything we want to predict. 

A further consideration, especially among younger children and 
among the more unstable or abnormal testees of any age, is that the 
group test yields only a numerical score, with none of the additional 
indications which are so useful in interpreting individual test re- 
sults. The Binet tester has much more effective control over the 
testee's state of mind. He ensures favourable co-operation, and can 
judge, albeit subjectively, whether his responses are typical of the 
best that he can do. Despite the small effects of mood, health, etc., 
on group test scores, there are likely to be many uncontrolled factors 
which upset these scores in individual instances, and which partly 
account for the imperfect correlations between different tests. It is 
artificial, again, to separate off intelligence as a purely cognitive 
variable from the rest of the personality. Normally it always operates 
in a certain emotional and social context. The group tester tries to 
ignore this; but the individual tester, by observing the child's man- 
ner of approach to difficulties (particularly those offered by per- 
formance tests), from the kind of test response and from general 
conversation, can build up a picture of the child's whole personality, 
and see how that personality applies his available intelligence. Thus 
he obtains a much more complete and practically useful view than 
can be got from a more scientific measure of ability factors, though 
admittedly this view is liable to subjective distortions. 

The present writer would hold that the most reliable testing of 
individuals must be done individually. But instead of one scale of 
somewhat indeterminate and variable content, he would prefer to 
see developed a series of shorter and more distinctive scales or sub- 
tests. Each sub-test should be shown, by factorial studies, to measure 
as homogeneous an ability as possible, if necessary over a limited age 
range only. Such abilities need not be pure factors. Indeed over- 
lapping is unavoidable, as they would all involve g, and possibly 
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other factors, in common. But they should cover as wide a range of 
mental functions as possible, and so enable us to carry out more 
penetrating and more scientific differential diagnoses. Instead of 
merely summing the scores to provide a measure of general intelli- 
gence, we should combine the results of several scales with appro- 
priate weights to measure, say, intelligence in everyday life situations ; 
a different combination would give aptitude for academic education, 
another for technical education, yet another for mental deterioration 
in different types of psychotic and brain-injured patients, and so on. 
Such weightings would be based on actual follow-up research, not 
on subjective judgments of what the scales measured. The Wechsler 
tests already go some way in this direction, but their sub-tests tend 
to overlap too highly, and they are mostly too unreliable to yield 
useful differential diagnoses. They give a good verbal I.Q. and a 
total (verbal + performance) I.Q.; but attempts to use patterns of 
scores, or weighted scores, on the different sub-tests for revealing 
various types of mental abnormality, seem to have broken down. A 
number of batteries of so-called differential aptitude tests have been 
published in America, descriptions of which are given by Anastasi 
(1954). Though some of these give promise of being more useful in 
vocational guidance than a single general test, they too suffer from 
insufficient reliability and too great overlapping. Thus the develop- 
ment of a really satisfactory set of tests, beyond the stage that 
Wechsler has reached, is by no means easy. 


LISTS OF TESTS AND METHODS OF MEASUREMENT 
IN EDUCATIONAL PSYCHOLOGY 


The following list includes most of the mental tests available in 
Britain at the time of writing, together with some particulars of age 
ranges, time requirements, publishers, prices, etc. A number of older 
tests, little used nowadays or out of print, are excluded; so also are 
most tests whose sales are restricted to Directors of Education or 
other specially qualified persons. A few tests published in Australia 
or in U.S.A. are included which seem likely to be useful to British 
psychologists and teachers. These and other tests from such countries 
can generally be ordered through the National Foundation for 


Educational Research in England and Wales (referred to in the list 
as NFER). 
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Publishers —Publishers’ names are abbreviated as follows: ' 


AE Australian Council for Educational Research 
B Baird Scientific Instrument Co., Edinburgh 
CL Crosby Lockwood 
G Gibson, Glasgow 
H Harrap 
ICP Institute of Child Psychology 
IPAT Institute for Personality and Ability Testing? 
L Н. К. Lewis, London 
LG Longmans, Green 
M Methuen 
NFER National Foundation for Educational Research 
NI National Institute of Industrial Psychology 
N Nelson 
OB Oliver and Boyd, Edinburgh 
PC Psychological Corporation, New York 
SRA Science Research Associates, Chicago 
U University of London Press 


Prices.—The prices quoted are those current in 1955 for 100 copies 
(including purchase tax), unless otherwise indicated. Naturally these 
are liable to fluctuation. Group tests are often available in dozens or 
sets of 25, at slightly increased relative cost. Single specimen copies 
with manuals are usually issued at about 2/- per test. An asterisk (*) 
in the prices column indicates that the answers can be written on 
blank paper or special answer sheets, or given orally, so that the 
printed blanks can be used over again. Where no price is quoted, 


the necessary material is given in the book referred to. 

— When a definite figure for time limit is given, this usually 
apart from instructions, etc. An 
5. shows the number of minutes 
at the time varies with different 


. Times. 
means the actual working time, 
approximate figure, e.g. 20-30, or c. 2 
including instructions, or indicates th 


testees. 
Age Ranges.—These are given in two columns headed C.A., and 


M.A. The former indicates that percentile, standard score or 
other norms are available for all children in this range, whereas the 
latter implies that mean (or median) test scores only are listed for 
certain ages. Thus a test with C.A. 10-12 covers a wider range than 


! Agent, Mrs. N. M. Barnes, The Grammar School, Henley-on-Thames. 
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one with M.A. 8-144, since the latter will not discriminate below 
I.Q. 80 at 10-0 years, nor above I.Q. 120 at 12-0 years, 

References to books or articles where descriptions of, or norms 
for, the test may be found are generally shown, by date, after the 
author's name. Some additional references and notes are given in 
small type. Readers who require a detailed critical evaluation of any 
American published test should consult the latest edition of Buros's 
(1953) Mental Measurements Yearbook. This, however, covers only 
a proportion of British and Australian tests. 
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CHAPTER X 
HINTS TO TESTERS 


1. Choice of a Test.—In choosing a test for some particular purpose, 
the tester's first consideration should, of course, be its validity for 
that purpose. As we have seen in Chap. IX, he may be able to judge 
its validity to some extent by inspection of its content, but he would 
be better advised to seek out experimental proof of what it is that 
the test measures. Investigations by its author, or by others who have 
used the test, should be consulted. The mere fact that it is called a 
test of such-and-such an ability does not constitute evidence. The 
second consideration (which greatly affects the first) is the test's 
reliability. Generally speaking, a test must be fairly long, and its 
Scoring must be objective, if it is to be satisfactory in this respect. 
Most authors of good tests publish evidence of their reliability in the 
accompanying manuals. Little trust should be placed in tests whose 
Scores are not known to yield a reliability coefficient of 0-90 or over 
in a representative group—say, a complete year-group of children. 

2. Age Range.—Few, if any, tests presume to cover all ranges of 
ability, and most educational and intelligence tests are only intended 
to cover quite a limited range. Thus the tester should be careful to 
choose from among the available tests those which will, when applied 
to a large group of his testees, be likely to yield a normal distribution 
of scores, and which will not unduly curtail this distribution at either 
end. For instance, in testing an ordinary school class, he should first 
estimate roughly the range of M.A.s or E.A.s which it contains. A 
class of average C.A. 10 years may often include individuals with 
M.A.s (on a group test) of 6 to 14 years, though the great majority 
Should lie between 74 and 123 years. He should then study the ages 
for which the various tests provide norms, and pick one which covers 
as much of this predicted range as possible. If he has reason to believe 
that the average M.A. or E.A. of the class will be, perhaps, a year 
above or below its average C.A., he must correspondingly adjust his 
choice of test. 

The ‘differentiating power’ of the test norms should also be noted. 
They should show a fairly large and even increase in score from one 


HINTS TO TESTERS 189 


age level to the next. Suppose that a test containing, say, 200 items 
has the following norms from 6 to 13 years. The increases are large 
up to 9 or 10, but then tail off, till there is a difference of only 5 items 
between the 12- and 13-year scores. Such a test will clearly not dis- 
criminate or differentiate well at the upper end, and its effective range 
of norms is from 6 to 11 rather than from 6 to 13. 


cars C 8 ОПО ТОВ 
Score 40 71 101 128 151 168 179 184 


Similar considerations apply to the choice of tests whose norms 
are issued in percentile, or other, forms. 

3. Convenience of Arrangement and Scoring.—If a test is to be 
used extensively, attention should be paid to its general arrangement 
from the point of view of ease in applying and scoring. With some 
performance tests and tests of special aptitudes, much time is wasted 
in setting out the material, or the instructions are very elaborate, or 
the scoring unnecessarily detailed. More modern American tests 
usually try to reduce these inconveniences. 

Group tests of attainment or intelligence may also vary greatly in 
‘mechanics’ or ‘lay-out.’ Some of the older ones have insufficient 
instructions and sample items or practice sheets, so that testees often 
fail to understand what they have to do, and a lot of additional oral 
explanation is needed. Sometimes the testees have to answer in their 
own words, and the scoring manual never allows for all the possible 
answers which turn up, so that much time is wasted in deciding 
whether or not an answer shall be counted. The positions of the 
answers are often scattered all over the page, instead of being ar- 
ranged in the right-hand margin. True, stencils are often provided 
which, when laid over the test page, show up the correct answers, 
But, according to the present writer’s experience, it is far more 
trouble to fit the stencils over each successive page than it is to com- 
pare the testee’s booklet with a booklet on which all the correct 
answers are clearly marked. 

Many American group tests are now issued with an interleaved 
carbon-paper device, so arranged that all the correct answers, and 
none of the incorrect ones, are automatically shown as crosses on 
the back of the page. The scorer then merely has to count the num- 
ber of crosses. Alternatively, testees may check their answers with a 
special carbon pencil; their blanks are then passed through an 
electrical machine which counts instantaneously the number of 
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carbon marks in correct positions, through the amount of current 
which passes. In this country, however, testing is seldom carried out 
on so large a scale as to warrant such ingenuity, and its consequent 
expense. 

4. Procedure.—Many of the following hints may seem puerile to 
Some readers, yet they are all based on the writer's experience of 
mistakes actually committed by amateur testers. Such mistakes may 
often lead to quite serious consequences. у 

The tester should first study the instructions and the test material 
carefully with a view to seeing precisely how the test works, and 
should then follow these instructions to the letter. Often it is worth 
the tester’s while to take the test himself. In Binet testing, the results 
should not be trusted until such time as the tester practically knows 
the instructions, and the essentials of scoring, by heart, and until he 
has given the test to, say, twenty cases. 

5. Modification of Instructions or Materials.—The inexperienced 
tester is very apt to find flaws in the test or instructions which, he 
thinks, could be eliminated by some simple modifications. Very 
likely the test could be improved, but then it would no longer be the 
same test, and the published norms would no longer hold good. For 
the same reason a printed group test must not be duplicated or 
mimeographed in order to save expense. Quite apart from the 
probable infringement of copyright, the difficulty of the test 15 
likely to be altered thereby. 

6. Timing.—Accurate timing is essential. In some group tests à 
mistake of 5 seconds may quite possibly lead to an error of 1% of 
2% in average I.Q. Testers who do not own stop-watches are advised 
to use omnibus tests when feasible, since these require only a single 
timing. If times are kept by a watch with a seconds hand, the exact 
minute-and second at which a test is to stop should be written down 
as soon as the test has started. Mistakes very easily occur if the 
tester trusts to his memory. 

7. Underlining and Erasing.—Before giving a group test to a class, 
all rulers and indiarubbers should be firmly removed, otherwise the 
pupils will waste several minutes in ruling their underlinings, ог 
rubbing out responses when they change their minds. If the test 
instructions do not already say so, the tester should explain to the 


pupils that, in order to change a response, they should scribble out 


the first choice rapidly, and then mark or underline the new choice 
clearly. 


ne Se 
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8. Distractions and Copying.—In all testing, possible sources of 
distraction should be avoided. For instance, a group test should not 
be given when a singing lesson is due to occur in the next classroom. 
The testees should be spaced as far apart as possible, since it is very 
easy, in multiple-choice tests, for one to overlook and copy the 
answers of another. Often the younger children have no intention of 
cheating, but are yet extremely anxious to write down the same as 
their neighbours, simply because the whole situation is unfamiliar to 
them, and they are afraid of doing the ‘wrong thing.’ 

9. Practice Sheets and Samples.—In most group tests for children 
there are preliminary practice sheets or sample items, where help 
may be given. The tester should make sure that even the dullest 
testees understand what to do, and that they can manage at least one 
or two of the sample items by themselves. 

10. Coaching.—No practice or coaching of any kind, beyond 
what is specified, should be given. Testees who gain an advantage by 
such means will no longer be comparable with those on whom the 
test was standardized, and the norms will be useless (cf. pp. 165—7). 

11. The ‘Subjective’ Situation.—At the same time as adhering to 
the standard objective conditions of testing, the tester should try to 
obtain, even in a group test, a favourable subjective atmosphere. If 
the tester himself is apprehensive, or bored, his mood is likely to 
communicate itself to a group of children. The testees should not be 
excited and anxious, nor, of course, lackadaisical, in their attitude 
to the test. Teachers should not peer over children’s shoulders while 
they are working to see how they are getting on; it inevitably upsets 
them. Still less should they criticise, or help, children with wrong 
answers. r 

12. Testing and Teaching.—Many teachers make bad testers be- 
cause the teaching attitude is so ingrained in them that they cannot 
refrain from drawing morals, or pointing out mistakes. A test is not 
a kind of lesson, and children should not be expected to learn any- 
thing whatsoever from it. Rather it is a matter of stocktaking. The 
essential object is to obtain a record of the testees’ performances 
under certain standard conditions, so that these performances may 
be compared with those of other similar testees who have been 
tested under the same conditions. 

13. Scoring.—The scoring instructions must be followed as 
meticulously as the instructions for applying the test. The tester may 
sometimes disagree with the response provided in the scoring key, 
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and think that some alternative is just as good or better. This may be 
$0, but once personal opinion enters the scoring ceases to be ob- 
jective, and the results again cannot be compared with the test 
norms. The scoring of most group tests is quite mechanical, but 
errors very readily creep in, both in checking the right responses and 
in adding up correct checks. It is extremely desirable for such tests 
to be scored twice, preferably by different persons. 

14. The Subjective Situation in Individual Testing. —The conditions 
of individual testing require still more tact and sympathy than those 
of group testing. By means of preliminary conversation the child 
should be put at ease, and interspersed irrelevancies should keep him 
interested and responsive throughout. If he shows signs of fatigue, 
distraction, or resistance the testing should be broken off. The good 
tester develops a kind of ‘bedside manner,’ and'avoids formalities. 
He does not read the oral tests from the handbook, but, knowing 
them by heart, says them in a conversational tone. In order to keep 
the child interested he must himself sound interesting. Encourage- 
ment should follow every response, but some caution is needed in 
praising incorrect responses, since many children may realize when 
they are wrong and become careless or discouraged if there is по 
criticism nor stimulus to do better. The presence of a third person, 
particularly the mother or head teacher, constitutes a distraction 
which should always be avoided. 

In giving the older versions of the Binet scale, it was the common 
practice among trained testers to jump about from one part of the 
scale to another. This enabled the tester to choose the test which 
seemed to him to be most appropriate to the child’s mood at the 
moment, also to save time by grouping together tests with identical 
instructions (e.g. the digit memory tests); and it prevented all the 
Most difficult tests from coming at the end of the session. This pro- 
cedure has been forbidden by Terman and Merrill (1937). But Hutt 
(1947) shows that it makes no difference to the I.Q.s of normal 
children and is, if anything, more fair to unstable children. The 
Terman-Merrill handbook gives several other useful hints on obtain- 
ing good rapport, while at the same time adhering strictly to the 
Standard instructions and scoring, 

15. Record Keeping.—The following principle is commonly 
adopted by experienced testers: never spend any of the child's time 
in making records of his responses. Notes should certainly be made, 
not only of test responses, but also of manner, speech, etc. All the 
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recording, however, should be done while the child is engaged either 
in doing a test, in irrelevant conversation, or in listening to the 
instructions for another test; and it should be so unobtrusive that it 
never constitutes a distraction. Poor testers not only waste time in 
recording responses and looking up the scoring, but also allow 
awkward gaps to occur which entirely spoil the favourable atmo- 
sphere of alertness and interest. The reason for this is usually mere 
lack of familiarity with instructions and scoring. 

16. Impartiality.—While the good tester enters very fully into the 
thoughts and feelings of the child he is testing, yet he also needs to 
exert considerable self-control. There are various ways in which he 
may (quite unwittingly) give illegitimate hints, for instance, by over- 
emphasizing the crucial words in an oral test, or by making empathic 
movements in the direction of the correct blocks, holes, etc., of a 
performance test. Further, he cannot help but feel more sympathetic 
with some children than with others. He should, therefore, fre- 
quently ask himself: “Am I in any way being prejudiced by my 
liking for this child; am I giving hints or crediting him with doubtful 
passes because I believe that he is bright, which I would not do if I 
believed that he was dull?” 

17. Studying the Whole Child.—The tester should never regard 
his job merely as measurement of the I.Q. or other test scores 
(although many doctors, teachers, and others may seem to regard 
that as his sole function). Rather he should try to understand, and 
to build up a picture of, each child’s whole personality. By doing so 
his test results are more likely to be accurate, since he will realize 
when they are being distorted by emotional or other factors ; and he 
will certainly be better able to interpret their significance when the 
guidance or treatment of the child is under consideration. Although, 
as we have seen, he is not justified in analysing a heterogeneous test 
like the Binet scale into tests of hypothetical discrete abilities 
(memory, verbal, practical, etc.), yet he can pick up innumerable 
hints regarding the child’s intellectual and emotional characteristics, 
which can later be explored more thoroughly. Suggestions of specific 
educational difficulties may readily be followed up. Buehler (1938) 
shows that responses to the Ball and Field (or Purse and Field) test 
may give valuable diagnostic indications. Other Binet items and, still 
more, performance tests like the Porteus Mazes may be equally 
revealing (cf. Vernon, 1937; Porteus, 1952). These deductions should, 
of course, be checked later by comparing them with other sources of 
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information, such as the psychiatrist’s case-record, when available, 
Tn this way a tester can always be extending his methods and im- 
Proving their validity. 

18. Interpretation of Test Results: Unreliability.—No mental test 
Score should ever be accepted at its face value, nor trusted in the same 
Way as physical measurements are trusted. Even the best tests, it 
should be remembered, only measure to within a certain Probable 
Error. In other words, a test score should be regarded as the centre 
of a range of possible scores. It may be correct to within, say, 576» 
but also it may be in error by as much as 20%, unless it has been 
confirmed by the results of other tests which narrow the range of 
uncertainty. 

19. Units of Measurement.—The imperfections of mental units 
should also be borne in mind, for example, the variations in size of 
M.A. and E.A. units (cf. p. 70-1). The I.Q. is still more difficult for 
teachers and laymen to interpret correctly. When derived from 
M.A./C.A. it is primarily a measure of rate of mental growth. But 
when it represents a standard score (as in Moray House, Wechsler, 
and other tests), it is—tike a percentile—a measure of brightness 
relative to other children of the same chronological age. In neither 
case is it an index of the amount of intelligence the child possesses 
at that moment; similarly the E.Q. is not a measure of present 
attainment. The M.A. or E.A., on the other hand, do provide such 
indices. The point may be brought out by the following example of 
the test scores of two children, A and B. 


А: . . СА.9:0 MA.11:0 10.122 
B: . . СА.8:0  MA.10:6 ТО. 131 


Here it is A, not B, who is the more intelligent, since he can do more 
difficult intellectual tasks, and can be expected to be more advanced 
in school work, It is true, however, that B is the brighter relative to 
8-year-olds than A is relative to 9-year-olds; and as B is growing 
more rapidly in mental powers, it is likely that he will catch up with 
A in another four years, and (if the test is reliable and valid) that he 
will, as an adult, be the more intelligent of the two. 

In actual practical work of educational guidance or emotional 
treatment, the tester need hardly bother to calculate the I.Q. or 
E.Q., since what matters is the present level of the child's intelli- 
gence and educational capacities. The only proper use of the 1.0. is 
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for predicting a testee’s future career, i.e. for assessing approximately 
his ultimate educational or vocational level. 

20. Dependence of I.Q. on the Test Employed.—The naive tester or 
layman usually assumes that a child would obtain the same I.Q. 
whatever the test employed. This is so far from true that the test 
employed should invariably be quoted along with an M.A., E.A., 
I.Q., or E.Q. The reasons for these discrepancies have already been 
given, and will merely be summarized here. 

(a) Chance errors of measurement, due to the fact that no test is 
perfectly reliable (cf. pp. 125-9). 

(b) Many tests are not well standardized. For instance, American 
group tests may give results which are too lenient (pp. 66-8). 

(c) Different intelligence tests have different dispersions of M.A.s 
and LQ.s, so that one test makes a bright child out to be much 
brighter, a dull child to be much duller, than does another test 
(pp. 71-3). 

(d) The testees may have had more practice or coaching relevant 
to one type of test than to the other (pp. 165-7). 

(е) All tests are liable to distortion by unsuspected group factors, 
or by emotional or other influences, about which we possess little 
exact knowledge, and which we are not able to control (pp. 163-5). 

In view of this uncertainty over the norms and the dispersion of 
group intelligence tests, the inexperienced tester would be well ad- 
vised never to calculate individual I.Q.s from them, but to treat their 
results in purely relative fashion, i.e. in the way which was recom- 
mended for educational tests (cf. p. 68-9). The class teacher can 
obtain from a group test practically all the information which it is 
capable of yielding if he or she simply finds which child in the class 
has the highest score (corresponding to the highest M.A.), which the 
next highest, and so on. When test results are thus confined to rank 
orders, they are far less likely to be misinterpreted than is common 
at present. 

21. Conclusion.—Mental tests have been widely, and often un- 
fairly, criticized. But the progress of testing is probably hindered to 


1 An exception may be made in the case of 11+ selection tests, which almost 
invariably adopt a Standard Deviation of 15. Such a use of quotients provides 
the simplest way of making age allowances, without which far more of the older 
children in a year-group of candidates than of the younger ones would gain 
places. At the same time, egregious statistical errors, leading to gross unfairness, 
can be and often are committed when Education Authorities try to run such 
examinations without the advice of a statistically trained psychologist. 
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a greater extent by its friends, who are ignorant of its limitations, 
than by its enemies. One of the main objects of this book is to 
analyse these limitations and to explain the underlying principles 
Which must be understood if tests are to be applied and interpreted 
Properly. No one ought to use tests who is not willing to familiarize 
himself with these matters, and with the literature which bears on the 
Particular tests that he employs. In skilled hands, testing provides a 
far surer and more accurate tool for the assessment of abilities than 
do subjective impressions or the ordinary examination paper. But 
human nature is far too complex to be measured in the simple and 
direct ways in which physical quantities can be measured, and the 
over-enthusiastic but unskilled tester is only too likely to make 
Serious mistakes, 


CHAPTER XI 
EXAMINATIONS 


Functions of Examinations.—The chief functions of ordinary written 
examinations may be summarized as follows: 

1. They are employed as tests of achievement, to prove that an 
individual pupil has, or has not, acquired a certain amount of know- 
ledge of a subject. 

2. They are given to a whole class, or school, for a similar purpose, 
or for assessing the efficiency of the teachers whose pupils are 
examined. Although ‘payment by results’ has disappeared, schools 
are still commonly judged by the public on the basis of their ex- 
amination successes. And both head teachers and inspectors try to 
ensure the maintenance of certain standards by means of periodic 


oral or written examinations. 
3. Many of the most important examinations are assumed to 


possess a prognostic function, i.e. to be capable of predicting future 
achievements. Thus they are used to bar the gates between the 
primary and the secondary school, between the secondary school 
and the university, between the university and many professions ; 
and they are supposed to select the few who are best fitted for these 
higher stages. 

4. Many qualities besides achievement, or promise of achieve- 
ment, in particular scholastic subjects, are believed to manifest them- 
selves in examination success. The business employers who insist 
that their clerks shall have passed Matriculation or the General 
Certificate do not actually desire employees who are good at Latin, 
biology, and the like. When asked why they attach such importance 
to academic examinations, their answers show that they hope, rather, 
to acquire clerks of good general all-round ability in non-academic 
as well as in academic subjects. They say also that success at an 
examination requires a certain amount of perseverance and in- 
dustry, some docility to discipline, and calmness under pressure; 
and that these qualities of character are desirable also in their 
employees. The significance attached to a university degree is often 
almost as variegated. Employers are probably quite as much to 
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blame for the stranglehold of the contemporary examination system 
as are the teaching profession and the education authorities. 

5. Another function of examinations whose existence is seldom 
explicitly admitted is that they constitute an incentive for stimulating 
pupils and students to work, particularly in the secondary school 
and university; also for keeping teachers up to the mark. Indeed, 
some would maintain that the prospect of having to do.examinations 
must still be employed as one of the chief motives in education, even 
if their results are entirely worthless. Clearly, then, examinations are 
an essential feature of the whole educational system_which relies 
largely upon extraneous motives, such as competition, fear of 
punishment, and blame. And they can hardly be discarded until the 
more progressive ideal of appealing to the pupils’ or students 
spontaneous interests becomes a practical reality. 

6. Many American writers (e.g. Kandel, 1936) believe that the 
only legitimate function of examinations should be that of guidance. 
By checking up the results of his instruction the teacher can see 
where he has failed to make some topic clear, and can improve his 
methods. By examining a pupil’s accomplishments he can both 
diagnose the pupil’s weak points which require special attention, and 
can direct his endeavours along the lines for which he is most suited. 
Secondary schools and universities should apply examinations to 
entering candidates, not so as to keep out all but a select few, but to 
discover the fields of work in which each candidate is likely to do 
best, and should dissuade from entering only those who could 
profit from no kind of advanced study. 

Unfortunately, the traditional written examination fails very 
badly in its attempts to fulfil these many different functions. We will 
Consider first the criticisms which have been levelled against it, and 
later ask whether other measuring instruments would serve the 
purpose better, or whether its adequacy can in any way be improved. 

Criticism of the Effects of Examinations upon Education.—The 
first and commonest complaint is that examinations dominate and 
distort the whole curriculum. Although only about 5% of the 
population reach the university, every grammar school pupil has to 
study for the General (or the Scottish Leaving) Certificate as though 
his ultimate objective was a university degree. The schools cannot 
afford to educate in the broader sense, nor to give the majority of 
pupils such knowledge as would be of most use to them in after 
life, because they must work towards examinations concerned 
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almost exclusively with formal, academic, subjects. Again narrow 
specialization is apt to occur in grammar and public schools, for the 
sake of university scholarships. 

Even more widespread are the bad effects of the 11+ secondary 
school entrance examination. Subjects or activities which do not 
contribute directly to passing the tests are generally neglected in the 
last year or two of primary schooling. In the larger schools, children 
are streamed eyen at the infant stage into classes thought likely to 
pass or fail. Homework and coaching out of school hours are intro- 
duced at too early an age—more often, be it noted, at the behest of 
the parents than of the teachers. 

Furthermore, these examinations stimulate an unhealthy, com- 
petitive spirit in children, and at all ages they encourage the cram- 
ming of set books and rote memorization. Doubtless they act as an 
incentive (Function No. 5) but not an incentive to learning for its own 
sake. Indeed, examinations аге an inefficient type of incentive, since 
most pupils and students regard them as too remote to bother about 
until a few weeks or days beforehand. A few candidates, however, 
take them much too seriously, or are forced by their parents and 
teachers to over-work; with the result that many of the maladjusted 
children who come to Psychological Clinics for advice and treat- 
ment show signs of this examination strain. 

An additional excuse is often advanced for this state of affairs, one 
which is based on the old fallacy of formal discipline. It is said that 
although the passing of examinations in academic subjects is of no 
direct value to the vast majority of the population, yet the pupils’ 
minds are trained thereby so that they become better able to learn 
other subjects, to reason, to write good English, and so forth. A 
large body of experimental evidence not only proves that this spread 
of training in one field of study to other fields is usually extremely 
limited in extent, but also suggests that the very opposite may occur. 
The system may indeed train pupils and students in the best methods 
of passing examinations in other subjects, but it is more likely to 
stunt ‘reasoning power’ owing to its emphasis on mechanical 
memorization. Often it produces a dislike of everything associated 
with school, and so discourages the desire for further education. It 
does not help to develop writing of good English, since the English 
style in which most examination answers are written is known to be, 
on the average, markedly inferior to the style in which the same 


pupils can write compositions at leisure. 
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INADEQUATE RELIABILITY OF EXAMINATIONS 


We will not, however, dwell on these points since, for the purposes 
of this book, it is a more serious matter that most examinations are 
very defective methods of measuring the achievements or the future 
promise which they set out to measure. Their chief, though not their 
only, flaw is inadequate reliability. The following facts are well 
established : у 

(а) When examinees sit two different examinations dealing with 
the same scholastic subject, both set and marked by the same 
examiner, they are likely to obtain distinctly different marks on the 
two occasions. 

(6) When a single set of scripts is marked by two different 
examiners, the resulting marks are liable to show wide discrepancies. 

The main causes of poor reliability are: 

(1) Actual changes in the examinees’ mental and physical states 
which affect their examination performances, "Wc a 

(2) Inadequate sampling of the examinees? knowledge and ability 
by the particular questions set. 

(3) Inconsistencies in the standards of marking adopted by 
different examiners, or by the same examiner on different occasions. 

(4) Differences of opinion between different examiners regarding 
the relative merit of the examinees’ answers. 

1. Changes in the Examinees.—This was described above (р. 128) 
as ‘function fluctuation.’ Two closely parallel examinations sat 0D 
Consecutive days will always give somewhat divergent results, ne 
cause some examinees will be feeling more fit, or in a better frame o 
mind, at the first of the two, others at the second. Many external re 
irrelevant happenings may temporarily depress or improve the UE 
of a few individuals. Over long intervals the examinees’ abilities w1 
naturally show considerable developments and alterations. On the 
whole, however, the uncertainties occasioned by such factors are 
probably smaller and less important than those due to the other 
three causes. 

2. Inadequate Sampling.—This factor is responsible for the od 
mon complaint among examinees, namely, that the questions 'di 
not suit them,’ that on another set of questions they might do better 
(or worse). A pupil’s capacities in the various special branches or 
aspects of a subject are always somewhat uneven. And no examina- 
tion is capable of bringing out everything that the pupil can do in 
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that subject, or of measuring all his strong and weak points. Hence, 
the half-dozen or so questions which he answers are never more than 
a sample of the questions which would be needed to cover the whole 
field. Common sense, as well as statistical principles, tell us that 
the more lengthy the examination, and the greater the number of 
questions, the more representative will the sample be, and the better 
will the ground be covered. Thus in certain university honours 
examinations where the students’ marks are based on twenty-four 
hours of examination work, their marks would be unlikely to alter 
to any great extent if they sat another similar series of papers. But 
when an examination consists, say, only of one short paper in 
English and one in arithmetic, quite large alterations might easily 
occur in further parallel papers. The correlation between results on 
ordinary 1- or 2-hour papers in the same subject depends con- 
siderably on the heterogeneity, or range of ability, among the 
examinees (cf. p. 122); but a fairly typical figure is + 0-70, and it is 
often less than this. We can therefore predict, by applying the 
Spearman-Brown formula (p. 127), that examinees would require to 
take at least four such papers before the ground was sufficiently 
covered for the reliability coefficient to rise to the fairly respectable 
figure of + 0-90. 

Suppose that one hundred pupils take such an examination, and 
that the passing mark is fixed so that half of them pass. Now if they 
take a second similar paper which correlates + 0-69 with the first, 
we can predict that only 37 of the pupils would pass on both, and 
37 fail on both, and that as many as 26 would pass the first and fail 
the second, or vice versa. This hypothetical but only too typical 
illustration demonstrates the unfairness which inevitably results 
from the lack of thoroughness of many examinations, and in par- 
ticular the difficulties which arise when the examination is supposed 
to divide the candidates into two distinct categories at an arbitrary 
pass mark. 

An additional reason for the variability of a pupil’s level in 
answering different questions or sets of questions is that the majority 
of examination answers consist of English compositions or essays. 
(This does not usually apply to papers in mathematics or in foreign 
languages; but it does to English, history, geography, science, and 
other subjects.) And the English essay is known to be a particularly 
variable production. The compositions written by a pupil weekly 
throughout a school year fluctuate widely in their merits. As Ballard 
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(1923) points out, teachers may refuse to accept this statement, and 
claim that their pupils are fairly consistent from week to week. But 
the fact that a teacher marks any one pupil’s work on a dead level 
probably indicates that, unwittingly, he has formed a stereotyped 
conception regarding his ability, and is marking the successive 
Compositions not on their own merits but in the light of his know- 
ledge of the pupil’s average achievement. The truth of this suggestion 
is borne out by the many stories of the two pupils who regularly 
received, one a good, the other a bad, mark, and who continued to 
do so even when they exchanged their essays. 

An outside examiner does not possess this knowledge of the 
pupil’s average work, and so will give poor marks to those who 
happen to write answers much below their normal level, good marks 
to those who rise above it. Moreover, under the stress of examination 
conditions, such deviations are likely to be exaggerated. Probably 
much more representative products would be forthcoming if the 
examination could be answered at leisure, under conditions similar 
to those which apply when pupils do written work at home, or at 
school. The common practice in Civil Service, and certain other, 
examinations of using a single essay as an indication of the candi- 
dates’ all-round literacy, cultural background, and ‘quality of mind 
is particularly deplorable, as pointed out by Vernon and Millican 
(1954) in an investigation of students’ essays on several different 
topics. f 

3. Divergent Scales of Marks.—This factor has already been dis- 
cussed fully in Chaps. II and IV, hence we will not pause to re-argu Я 
the illogicality and injustice of a system which permits а far higher 
Proportion of passes, or a higher average mark, to be awarded in 
one university or General Certificate subject than in another. Ifa 
mark of 80%, or a first class or credit, signifies to one examiner the 
standard achieved by a quarter of the examinees, and to another the 
standard reached by only 1% of the examinees, then it is obvious 
that this mark means different things to different people. And it 
follows that an examinee’s mark will vary with the particular €x- 
aminer who happens to read his papers, unless all the marks given 
by all the examiners are uniformly standardized in accordance with 
the principles outlined above. 

The standards of marking even of a single examiner are liable to 
alter with fatigue and mood. When marking large numbers of answers 

to the same question, he may begin very strictly and gradually 
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become more lenient, or vice versa. The same script might receive 
a different mark if read after, instead of before, dinner. 

4. Differences in Opinion as to Relative Merit.—The most impor- 
tant source of unreliability (which is usually found in combination 
with No. 3) is the personal element which enters into every ex- 
aminer’s evaluation of a script. We will first summarize some of the 
evidence regarding this subjectivity. The classical experiments were 
carried out in America as far back as 1912. Starch and Elliott (1913) 
sent copies of a single geometry paper to the chief geometry teachers 
of 116 high schools, with the request that they should mark it in 
accordance with their usual practice. The marks awarded varied all 
the way from 28% to 92%. Two of the teachers gave it more than 
90%, 18 gave it between 80% and 90%, 18 gave it between 609% and 
30%, and 2 under 30%. Similar results were obtained with papers in 
English and in history. Equally striking is the anecdote related by 
Wood (1921) of the papers which were marked in the usual way by 
six examiners. The first examiner, for his own guidance, wrote out a 
set of (what he regarded as) model answers to the questions; but he 
unfortunately left this among the candidates' papers. He was subse- 
quently awarded marks varying from 40% to 907; by the other five 
examiners! Many experiments, conducted under far more strictly 
controlled conditions than these, have substantially confirmed their 
findings. Particularly noteworthy are the elaborate studies carried 
out in England and France in the 1930s, under the auspices of the 
International Examinations Enquiry Committee. 

Hartog and Rhodes (1935) arranged for typical groups of scripts 
to be marked by experienced examiners under normal examination 
conditions. Special place, School Certificate, and university honours 
examinations were represented, and several different subjects were 
studied. Generalizing from a number of their results, it appears that 
when half a dozen examiners independently mark a set of scripts, the 
lowest and highest marks given to any one script differ on the average 
by about 10 to 12%. Among a few of the scripts the discrepancies 
may be quite small, but among others they may be far larger, often 
greater than 20%. The authors show that part of the divergence may 
be ascribed to the different standards of leniency adopted by different 
examiners (No. 3, above), but that even when this factor is elimin- 
ated, large variations remain. In several of the experiments the 
examiners drew up an agreed scheme of instructions for marking, 
listing both the general characteristics and the detailed points which 
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were to be looked for in the scripts. In the marking of English com- 
positions, where this was scarcely possible, the inconsistencies were 
much greater, 

Now Hartog and Rhodes doubtless desired to shock the public 
conscience with regard to examinations, and so did their best to 
emphasize the disagreements. The present writer is more impressed 
by the smallness than by the largeness of the discrepancies, and 
would conclude from his study of the results that important public 
examinations, like the School Certificate and university honours, are 
marked much more consistently than is usually the case. Instead of 
laying their chief emphasis on the extreme discrepancies, the authors 
might have pointed out that the median disagreement between any (0 
examiners is not more than 3% in the best-conducted examinations. 
Since this figure represents the Probable Error of an examinee’s mark, 
a few examinees will naturally show discrepancies four or five times 
as large. It represents an average correlation between different 1 
examiners of + 0-80, when the standard deviation of their marks 15 
10%. And decidedly lower figures have been obtained by other in- 
vestigators. The authors also omit to point out that the reliability of 
an examinee's final total mark is generally very much higher than 
their results indicate, since it may be based on half a dozen or more 
papers, each marked by two or more examiners. Their results mostly 
refer to single three-hour, or two two-hour, papers. 

Nevertheless, this study deserves to be taken very seriously, since 
it indicates that many less thorough examinations are deplorably 
unreliable; that in the absence of a scheme of instructions drawn UP 
and applied by experienced examiners, much worse discrepancies 
may arise; that when the average and the dispersion of marks are 
not standardized, gross differences may appear in the proportions of 
Credits, Passes, Fails, etc., which are awarded ; and that even a Р.Е. 
of 3% may frequently make all the difference between a Pass and a 
Fail, or a First and a Second Class. 

Another example will be quoted which better illustrates the sub- 
jectivity of marking under ordinary school conditions. Boyd (1924) 
collected large numbers of essays on ‘A Day at the Seaside’ from 
11-year-old Scottish school-children (i.e. children at the qualifying 
stage). From these he selected 26 as representing the whole range of 

1 As most reviewers of An Examination of Examinations pointed out, some of 


the experiments were rendered quite futile by the selection of unduly homogene- 
ous sets of scripts for the examiners to mark. 
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merit, and had them marked independently by 271 teachers. The 
marks were not numerical, but consisted of seven grades, ranging 
from Excellent to Unsatisfactory. On comparing the results of 
different markers, it was found that almost every essay received at 
least 6 of the 7 possible grades. The following instance is typical: 3 
teachers marked it Excellent; 31 VG +, 80 VG, 125 G +, 28 G, 
and 4 marked it Moderate (the lowest grade but one). Boyd shows 
that part of the discrepancy was due to differing standards. But even 
when the 20% most lenient markers and the 20% most severe were 
excluded, the average essay was still awarded at least 4 out of the 7 
grades. He also points out that the average grade given to an essay 
by a large number of markers is highly consistent, and that the 
various divergent individual grades are distributed around this 
average more or less in accordance with the normal distribution 
curve. Actually it has been shown by Wiseman (1949) that a reason- 
ably reliable mark for English composition can be obtained at the 
11-- selection stage (since the range of ability is very much wider 
than it is among highly selected grammar school pupils or university 
students), provided that four examiners' marks are combined. It is 
desirable also that each candidate should write on two or more 
topics. The burden of marking is not so heavy as might appear, since 
only the essays of some 10 to 20% of borderline candidates need to 
be considered ; and rapid general impression marking at the average 
rate of one minute per script is adequate. 

Other experiments, including one of Hartog’s, indicate that when 
the same examiner re-marks a set of scripts after an interval, the 
discrepancies between his first and second opinions are likely to be 
almost as great as those between two independent examiners. Some 
of the studies by the French members of the International Examina- 
tions Enquiry are also worth quoting. A set of papers in science were 
marked by three trained scientists (one of them doing it twice over), 
and also by a student who had no scientific training. When the mark- 
ings were compared, the average correlation coefficient between the 
scientists was + 063, whilst the correlations between the student and 
the scientists averaged + 0-51. The latter figure is lower than the 
former, but so little lower that it casts some suspicion on the value of 
an examiner’s knowledge of his subject. 

Particularly striking was a study of the philosophy examination in 
the Baccalauréat. One hundred scripts were marked by six examiners. 
Only 10 of the candidates were passed by all the examiners, only 9 
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failed by all of them. The remaining 71 were passed by some, failed 
by others. It is, of course, always the examinees of medium ability 
whose true standing is most uncertain. The few best or few worst are 
recognized as such by all examiners. But only too frequently an 
arbitrary dividing line—the pass mark—has to be erected in the 
middle of this region of maximum uncertainty. Undoubtedly, there- 
fore, a large number of candidates who just succeed, or just fail to 
gain a General Certificate pass or a scholarship might very easily 
obtain the opposite result if they were marked by a different examiner, 
or if they sat an alternative set of papers. In many instances their 
Whole educational or vocational career máy: be blighted owing (0 
the unreliability of their marks. 

Reasons for the Subjectivity of Marks.—The prime reason for the 
discrepancies which we have described is that different markers con 
ceive differently the desirable or undesirable characteristics of хан 
ination answers. In other words, they attempt to assess many diva 
types of abilities. One examiner may give chief credit for evidence © 
work done, and for the comprehensiveness and accuracy of the facts 
contained in the answers. Another may look rather for signs О 
future promise, originality, and grasp of general principles. One may 
insist on clarity of expression of ideas, another may try to assess he 
profoundness of the ideas irrespective of their good or bad goer 
sion. Some may search for emotional rather than strictly intellectua 
qualities, such as the examinee’s interest in the subject. Whenever a 
candidate states his attitudes or opinions, these are liable to coincide, 
or clash with, the attitudes or opinions of the examiner ; and Bowers 
desirous the examiner may be of maintaining impartiality, he i lana 
to bias if his pet theories are approved or attacked by the candidates. 
Often, if he would take the trouble to formulate explicitly his notion 
of the aim of the examination, he would find that he means by a 
good examinee the one who closely approximates to himself, and as 
а poor candidate the one who shows none of the intellectual of 
emotional qualities which he himself idealizes. Is it to be wondered, 
then, that different examiners differ in their opinions of a candidate! 

We must recognize that any one script may manifest an extreme 
complexity of features, good and bad, and so can be looked at from 
a multitude of different points of view. Its merit can never be assessed 
in the same way that a physical object is measured. An examiner 


1 This result suggests that the correlation between any pair of examiners 
averaged only + 0:60. 
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ordinarily splits it up into its chief components, notes the facts which 
it contains or omits, its elements of originality, style of expression, 
the mechanics of its grammar or computations, and so on. He 
attaches a certain weight to each of these possible components (a 
weight which is likely to be highly individualistic, unless a detailed 
scheme of marking has been adopted), and finally re-synthesizes the 
candidate’s performance on each component into a single mark for 
the answer as a whole. Some examiners do not attempt a systematic 
analysis of this kind, and merely mark on the basis of a general 
impression. But as Ballard (1923) and Hamilton (1929) show, every 
general impression involyes a more or less vague analysis. 

Even in such subjects as arithmetic, or the translation of foreign 
languages, the weightings assigned to the various analysable features 
of an answer, and the penalties given to various possible mistakes, 
are liable to differ among different markers. But the complexity is far 
greater in the case of English composition, or in those subjects such 
as history, science, etc., where the answers consist chiefly of com- 
positions. The essay-type of answer has been very widely, and justi- 
‘fiably, criticized. Examinees have to waste a large proportion of the 
available time in translating their historical or scientific knowledge 
into readable essay-form, and in the mere process of writing. This 
y unfair, because some can write much faster than 


latter is particular! 
others without necessarily being better historians or scientists. A 
itten words (for instance, the a's, 


considerable proportion of the wri 
the’s, and’s, etc.) add nothing to the evidence of the examinee’s good 


or poor ability. Investigation shows that the average examinee puts 
into writing less than one fact or idea per minute of examination 
time, since the process of expressing these facts or ideas takes so 
long. The examiner, then, has the still more difficult task of transla- 
ting the product back again, of trying to penetrate through the 
yerbiage to the signs of ability and knowledge beneath. Few exam- 
iners can remain entirely uninfluenced by the literary style or the 
handwriting of an answer (there is direct experimental proof of this), 
although they are presumably trying to mark the product for its 
historical or scientific merit, not qua essay. 


INADEQUATE VALIDITY OF EXAMINATIONS 


f examinations greatly reduces their value for 


The poor reliability o 
For if they cannot even measure examination 


any and every purpose. 
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success without serious errors, they naturally cannot validly predict 
anything else. Even, however, if perfect reliability were established, 
validity would still be diminished by the dependence of examination 
success on factors other than ability, knowledge, and industry. 

One such factor is, obviously, the efficiency of the examinee’s 
teachers as examination coaches, and their cleverness in guessing the 
kinds of questions which examiners will set. It is manifestly unfair 
that a pupil at one school, where the teachers are specially adept at 
instilling examination knowledge, should succeed when a pupil of 
equal ability fails because he attends another school, where the 
teachers are either less skilled, or more humane in the education 
which they provide. 

There is also the old complaint that some pupils are ‘good exam- 
inees,’ while others become unduly ‘nervous,’ and ‘fail to do them- 
selves justice.” One wonders whether it is not usually the weaker 
candidates who make this complaint, and whether they would do 
better if subjected to any other kind of test, including the tests of real 
life. However, we may admit that certain temperamental qualities, 
which we can as yet hardly specify, together with good physical 
health, may confer some advantage on certain candidates, which 
other candidates lack, 

Examinations, then, certainly fail to fulfil the first two functions 
with which we started, namely, the measurement of achievement, 
both because of poor reliability and because of the other considera- 
tions just mentioned. How imperfect they are we cannot determine. 
We can, however, study their validity as measures of future ability 
by directly correlating their marks with the results of the same pupils 
at later stages in their careers. Comprehensive investigations into 
this matter were carried out by Valentine and Emmett (1932), with 
almost uniformly disheartening results. By comparing the subsequent 
secondary school achievement of groups of pupils who had been 
awarded, or who had failed to obtain, scholarships at the time of the 
special place examination, it was found that many of the latter 
ultimately surpassed the former group. Similarly, at the university 
level, Valentine found that roughly two-fifths of all those awarded 
scholarships failed to justify their promise, since they eventually ob- 
tained only third-class honours or pass degrees. They were beaten 
by as many as one-third of the non-scholars, who achieved first- or 
second-class honours. Of those who won scholarships given by the 
universities, only about one-fifth failed to live up to expectations, of 
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State scholars only one-tenth. But this meant that practically one- 
half of those who obtained other types of scholarships (from their 
schools or from local authorities) did not justify themselves (cf. 
Valentine, 1938). 

In a further study by the Scottish Council for Research in Educa- 
tion (1936), marks gained by secondary school pupils at the Leaving 
Certificate examination, together with teachers’ marks and the head 
masters’ estimates of the pupils’ general standing, were correlated 
with subsequent class marks and degree marks at the universities. It 
was found that on the whole those who got first-class honours had 
been rated higher by the school head masters than those who got 
second class, and that honours students had been rated higher than 
ordinary degree students ; but that, as in Valentine's research, there 
were a great many exceptions. The actual correlation coefficients 
between school and university marks varied from about -j-0:30 to 
--0:70, and the average of several such results was + 0:48. In the 
light of our discussion of correlations (Chap. VII), this figure im- 
plies very poor predictive validity. 

It should be remembered, however, that the above investigations 
(together with most of the studies of reliability) dealt with highly 
selected groups. At the 11+ selection stage practically the whole 
range of children is examined, and more reliable tests than the 
conventional subjectively marked papers are generally used nowa- 
days. The result is that this examination gives correlations around 
0:85 with performance in secondary school work 2 to 5 years later 
(Emmett and Wilmut, 1952). Moreover, in view of the imperfect 
reliability of the school marks used as a criterion of achievement, the 
true figure should be of the order of -Е 0-90. Yet even this coefficient 
implies that many mistakes are made in selection. If 20 out of 100 
are admitted to grammar schools, 5—that is one-quarter—are likely 
to turn out unsuitable, and 5 who have been rejected and sent to 
modern schools would have surpassed them. 

]t would be unreasonable to expect perfect correlations between 
marks at successive stages in pupils’ careers, since the subjects 
studied at a secondary school differ distinctly from those studied at 
a primary or preparatory school, even when called by the same 
names. And the conditions of work at a university differ from those 
at a school, being more congenial to some, less stimulating to 
others. Again, every individual changes to some extent as he grows 
older, some of his abilities becoming relatively weaker, new ones 


210 THE MEASUREMENT OF ABILITIES 


developing, All in all, then, the inability of entrance and scholarship 
examinations to choose those candidates who will profit most from 
higher education is not very surprising to the psychologist. Yet their 
validity for this purpose is certainly much poorer than is generally 
supposed by the educational authorities who set the examinations, 
and who base their awards and selections upon them. The universi- 
ties would scarcely continue to dictate the present constitution of the 
General and Leaving Certificates, and Matriculation examinations, 


if they did not believe that these instruments picked out the most . 


suitable entrants. 


Still more unjustified is the belief embodied in the fourth of the 
functions outlined above (p. 196), namely that these examinations 
are adequate criteria of vocational fitness. It is indeed true, as em- 
ployers assert, that successful candidates will on the average be 
better in general all-round ability than those who have failed to gain 
the Certificates, since all scholastic abilities are to some extent 
positively correlated with intelligence. It is probably true, also, that 
vocationally desirable temperamental and character qualities may 
aid in examination success. But so do a great many other factors. 
Some pupils with extremely undesirable characters may do well at 
examinations through intellectual brilliance ; others through efficient 
coaching on the part of their teachers, or through over-cramming ; 
still others through the pure chance factors introduced by the 
subjectivity of marking and other forms of unreliability. The con- 
clusion follows, then, that success or failure may depend on so 
many diverse qualities, that it does not provide trustworthy evidence 
of any of them. The employer would be far better advised to make 
use of the services of a scientific vocational psychologist, who would 
measure or obtain the best available assessments, both of intelligence 
and other aptitudes, of educational achievements and of tempera- 
mental traits, and would give each of these qualities its due weight 
in deciding on the suitability of prospective employees. Undoubtedly 
the adoption of a similar scientific study of each individual candidate 
for secondary or university education would lead to many more just 
selections, and more accurate predictions of future achievement, than 
do the present methods (cf. Dale, 1954). 

Although this chapter has consisted mainly of destructive criticism, 
certain hints have been given as to possible ways and means of 


improying examinations. These will be followed up, and expanded 
in the next chapter and in Chap. XIV. 


| 


CHAPTER XII 


RELATIVE ADVANTAGES OF NEW-TYPE AND 
ESSAY-TYPE EXAMINATIONS 


Ir has been widely claimed that the worst defects of the traditional 
type of written examination may be overcome by adopting in its 
place the objective or new-type test for measuring achievement. 
Such tests came into common use in the United States many years 
ago, but are still little known in this country, in spite of Ballard's 
(1923) able advocacy. Americans, however, realize now that they 
also are liable to certain defects when used undiscriminatingly, and 
that there are still functions which can better be performed by the 
old- or essay-type of examination. Thus we are in a position to profit 
both from their researches and from their errors, to devise our tests 
in accordance with the principles they have already worked out, and 
to apply them to the fields of educational measurement where they 
will be most useful and most valid. 

Now it is clear that some educational products can be marked 
much more objectively and reliably than others. In the early stages 
of school arithmetic, when children are given a series of short sums 
to do, there is no subjective element involved in deciding which ones, 
and how many, they do correctly. But before long arithmetic be- 
comes more complex; each sum involves a number of different 
Operations, partial credits are essential, and different markers are 
now liable to differ. An English composition, or an examination 
answer written in essay form, brings in a still greater personal ele- 
Ment, since it is still more complex and cannot ever be considered 
merely from the point of view of right or wrong. The new-type 
tester argues, therefore, let us refrain from setting sums or essay 
Questions which involve many operations, and instead ask questions 
Which evoke each operation separately, questions to which only one 
Tight answer is possible. And instead of leaving to the marker's 
Personal taste the analysis which must be made of a complex product, 
and the weighting which must be attached to its component ele- 
ments, let the person who sets the questions make the analysis 
beforehand, and decide systematically just now many marks (i.e. 
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how many questions) shall be allotted to each element. 

Advantages of New-type Exeminations.—So we arrive at the new- 
type examination, which contains a large number of brief questions 
instead of a few long ones. Every question is so arranged that all 
markers will agree as to the rightness or wrongness of an answer. 
This elimination of the personal element is not the only advantage. 
For the hundred or more questions of the new-type test are capable 
of sampling the whole field of knowledge much more "compre- 
hensively than can the half-dozen or so questions of the orthodox 
examination. Candidates can no longer complain that the particular 
questions failed to suit them, since all their strong and weak points 
are probed. This fact should help to reduce the ‘spotting’ and cram- 
ming of likely questions. Again, the subjectivity of standards of 
marking need no longer cause trouble, since the examinees’ marks 
are pure ‘count scores’ (cf. p. 20). So long as the questions are de- 
signed to cover all levels of difficulty evenly, the distribution of marks 
is sure to approximate to normality, and can readily be converted 
into any desired percentage or other scale. The examiner may, of 
Course, draw an arbitrary pass line at any point in the distribution 
that he wishes, so as to cut off the examinees who fail to come up to 
his minimum standards. 

A full description of the various sorts of questions which may be 
employed in new-type tests is given in the next chapter. Here it will 
be sufficient to distinguish between the so-called ‘open’ or ‘recall’ 
type of question and the ‘closed’ or ‘recognition’ type. In the former 
the desired answer consists of a single word, or a few words, or 
numbers, which the examinee must write either in a space provided 
on the examination paper itself, or on a separate sheet. Questions 
1-16 on pp. 213-14 are of this type. Since, however, there is always 
some danger of partially right or alternative answers being given, 
whose marking would involve subjective judgment, the other type of 
question is more commonly adopted, where the examinee has to 
identify the right answer from among two or more answers printed 
on the examination paper. Either he may underline or put a check 
beside the answer he believes to be correct, or else merely write 
down its letter or number. Questions 17-54 on pp. 215-19 are of this 
type. 

This latter system has the additional advantage that practically all 
writing is eliminated. So that when geography or history is being 
examined, irrelevant factors such as speed of writing or literary style 
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do not distort the examiner's marking. It has been claimed that the 
average examinee, instead of expressing less than one idea or fact 
per minute (as in an essay-paper), can deal with three to six new-type 
questions a minute. But this naturally depends on the level of diffi- 


Such questions a better insight into them. Nos. 1-8 are simple-reca]] 
type, Nos. 9-16 completion type, Nos, 17-25 true-false type, Nos. 
26-36 multiple-choice type, Nos. 37-47 matching type, and Nos. 


48-54 rearrangement type. The Correct answers are listed on 
p. 258. 


Please read the following directions. carefully before Starting: 


letter, word, or number. 


ou may guess if you are not certain of an answer; but as marks 


will be deducted for wrong answers, random guessing is not 
advisable, 


1. Eight pupils obtained the following school marks: 


he Bh CUED T Bc T E 
ТӘУЛ о ООО 10m ТЕ 


What is their median? E : . а 
aarti the above marks Were put into rank order form, What would 

B's rank Position be? 5 2 7 6 ЭТЕ 
M.A.—15 


214 
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3-5. The following is an extract from the norms for a group intelli- 


6-8. 


gence test: 


Score . - 48 58 68 77 84 89 93 
Mental Age. 9 10 11 12 13 14 15 and adult 


Percentile Mark 
73% 

75 : 67% 
Sor uU 007 
25... А 53% 
10 214 РАТ 


What is his approximate place (i.e. his rank position in the 
class) in: 

6. English? 

7. French? 


write anything in the actual spaces, but opposite No. 9 in ш 

right-hand margin write the word or letter which is needed 

fill up space (9). Fill in the other words or letters similarly. 
The compensation theory of mental or- 

ganization is disproved by the fact that all 9. 

correlations between mental abilities are (9). 

The results of correlational investigations 10), 

also disprove the view that the mind is made 

up of distinct (10). The Two-factor "Theory, 11. 

proposed by (11), regards each ability as 

made up of a factor common to all abilities, 12; 

called (12), and a factor (13) to that ability 

alone, called s. However, when verbal tests, 13. 

performance tests, and other tests of intelli- 

gence are compared, it is doubtful whether 14. 

(14) will account for all the inter-correlations. 

Residual correlations indicate the presence 1:58 

of (15) factors, such as verbal ability, у, and 

practical ability (16). 16. 


17-25. Write a + sign in the margin after any of the following 
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statements which are true, and a — sign after those which are 
false. 

17. An omnibus test is a group intelligence test 
in which items of many kinds are mixed up 

18. A majority of very duli children are below 
average in health and physical develop- 
ment . Н у d у 3 t 

19. Performance tests are applied by psycholo- 
gists chiefly in order to measure a child's 
practical ability in everyday life B j 

20. In order to measure the average intelligence 
of a class of 7-year-old children, a teacher 
would be best advised to use a group test 

with pictorial material . 8 3 $ 

21. In a normal frequency distribution curve the 
semi-interquartile range and the standard 
deviation are identical d d a 

22. University students who obtain Scholarships 
on the basis of competitive examinations 

as often as not obtain no better degree 
marks than those who fail to win such 
scholarships . 3 8 4 c 5 

23. Тһе advantage of scoring abilities in terms 
of Mental Age or Educational Age is that 

the units of measurement are all equal to 

one another . З d t : | 

24. A group of a hundred children Picked out as 
being especially Superior in one school 
subject will practically never show as great 

an amount of average superiority in any 

other subject $ $ 5 5 У 

25. The Probable Error of a pupil’s English com- 
position mark js generally likely to be 
smaller than that of his arithmetic mark. 0... 


26. Write the letter (A, B, or C, etc.) of the Correct answer in the 
margin. 

26. Group intelligence tests, as contrasted with 
the Stanford-Binet test, are: , 

(A) More varied in content 

(B) Better measures of children's intelligence 

(C) More objectively scored 

(D) More difficult to apply 

(E) od suitable for application to superior 
adults 


27-32. A set of 51 pupils took three examinations, A, 
B, and C. The following distributions of marks 
Were obtained: 
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Mark A B c 
87 =. 1 1 3 
80 + . ЕЈ 2 5 
B+. 4 4 T 
66-r . 18 7 11 
59256 17 1 11 
S2 5 5 14 8 
LE 2 9 4 
38+ . 1 3 2 

51 51 SSS 


27. Which of the distributions approximates 

fairly closely to normal, A, B, C, or 

INGE Ta M ЕЛ АЕ ЕТ ЕЕРЕЕ 
28. Which of the distributions i is negatively 

skewed, A, B, C, or None? Sas А 
29. Which of the distributions is likely to 

have the smallest standard deviation? ...-.:-: 
30. A certain pupil happened to get 70% on 

all three examinations. In which of 

them did he do relatively best? «ОЛЫ 
31. If you were calculating the average of 

Distribution C, what figure would you 

take as your arbitrary mean? ел ss Tee ЕЭ 
32. In calculating this same average, what 

figure would be multiplied by 2 (i.e. 


by the number of pupils who score 
apice. c. ria o cine SETTE si 


33. A group intelligence, test was Applied in a 
secondary school, and the occupational 
preferences of all the pupils were ascer- 
tained and classified under ten headings. 
In order to find whether there is any cor- 
respondence between occupational prefer- 
ence and intelligence, which of the follow- 
ing statistical ПЦ could best be 
applied? 2 d нео Gta 

(A) Biserial r. 

(B) Product moment correlation. 
(C) Analysis of variance. 

(D) Contingency coefficient. 

(E) Correlation ratio. 

(F) Multiple correlation. 

34. Which is the /east serious of the following 
defects in intelligence quotients obranca 
from group intelligence tests? . 
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(A) Group test scores are considerably affec- 
ted by previous practice on similar tests. 

(B) The standard deviation of group test 
1.0.5 differs considerably in different 
tests. 

(C) Few of the published tests are adequately 
standardized, or have reliable norms. 

(D) Group tests are generally given with a 
time limit, and the child who is quick- 
est is not necessarily the most intelli- 
gent. 


35. Which one of the following statements is un- 
true? The distribution of marks given by 
an examiner to a class of fifty pupils may 
be very irregular or skewed because: ae clo. А 

(A) The number of cases is too small for a 
smooth, normal distribution to be 
expected. 

(B) The examiner's opinion of the relative 
merits of the scripts is a personal one. 

(C) The pupils' abilities may be asymmetric- 
ally distributed, as when, for example, 
the class contains a long tail. 

(D) The scale of marks which the examiner 
adopts may be lacking in any rational 
basis. 


36. In an experiment on the Scores of graduate 
and non-graduate teachers on verbal and 


non-verbal group intelligence tests, these 
results were obtained: 


Verbal Tests Non-verbal Tests 
Graduates . d . 153-9 78:6 
Non-graduates . - 143-5 77-0 


P.E. of the difference 
between means . \ 2-1 1-2 
Which of the following conclusions may legi- 
timately be drawn? 1 : 5 " 
(A) Graduates do better than non-graduates 
at intelligence tests. 
(B) All teachers do better at verbal than at 
non-verbal tests, 
(C) Scores on verbal tests are improved by 
taking a university degree. 
(D) Teachers with university training have 
no appreciable advantage over others 
on non-verbal tests. 
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F 
F | 
LQ. LQ. 
(A) (B) 
F F 
1.Q. 
(Сејан (0) 


in · The above pairs of curves (A-D) represent the approximate 


requency distributions of the intelligence quotients of various 
groups of children. Below are listed several such БОШ 
Opposite four of them write the letter of the pair of OR 
which corresponds. Write ‘None’ opposite the groups to nine 
no pair of curves corresponds. 


37. One thousand elementary school pupils, and 
one thousand special school pupils . Hoops dev: 
· All elementary school pupils and all special 
school pupils in the country . E E 
- All elementary school pupils and all children 
in the country 5 3 - 5 : 
40. All special school and all secondary school 
pupils in the country. 3 а t 
41. All elementary school and all secondary 
School pupils in the country А е ДО 
42-47. Below is given a list of psychologists (A—E), together with a 
list of important pieces of work in educational psychology. 
After each of the latter write the letter of the psychologist who 
was mainly responsible for, or whom you chiefly associate with 


38 


39 
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this piece of work. Two letters may be written if two psycholo- 
gists were prominently associated. Some letters may be used 
twice or more, some not at all. 


(A) Ballard 42. Development of the con- 
(B) Binet ception of the Intelli- 
(C) Burt gence Quotient . АРА 
(D) Drever-Collins 43. Development of a scale 
(E) Terman of performance tests . ......., 
(F) Thurstone 44. Investigations by corre- 


lations into the factors 
underlying abilities 
45. Advocacy of new-type ex- 
aminations. 5 М 
46. Construction of educa- 
tional attainments tests 
47. Re-standardization of the 
Binet-Simon scale 


48-54. An arithmetic test, a handwriting test, and two group 
intelligence tests were given to an ordinary school class, and 
the following correlations between test scores were worked out: 


(A) Combined intelligence tests with handwriting test. 
(B) Combined intelligence tests with arithmetic test. 
(C) Combined intelligence tests with chronological age. 
(D) One intelligence test with the other. 

(E) Arithmetic test with handwriting test. 
(F) Handwriting test with chronological age. 
(G) One intelligence test with arithmetic test. 
Which pair would you expect to yield: 

48. The highest correlation? h 

49. The second highest? . 

50. The third? б 

51. The fourth? 

52. The fifth? 

53. The sixth? 5 

54. The lowest correlation? 


DEFECTS OF NEW-TYPE EXAMINATIONS 


So far only the advantages of new-type examinations have been 
mentioned, and we must now consider criticisms of them in detail, 
Two minor objections will be admitted at once. First, the cost of 
printing or duplicating a new-type test is considerably greater than 
that of an essa -type examination. For classroom purposes, however, 
oral presentation of the questions can often be adopted. If several 
BroUps of pupils or students are to be examined, they can write their 
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answers on blank sheets, and their test papers can be used several 
times over. f 
Secondly, much greater time and skill are needed for constructing 
a good new-type test. The major criticisms which are to be discussed 
below apply only too often to inferior tests which have been rapidly 
constructed by inexperienced amateurs. It is highly desirable that 
teachers should be trained in the principles of testing during thelr 
training college courses. The time needed in setting a paper is, ROW- 
ever, more than offset by the time saved in marking, when the num- 
ber of examinees is large. Indeed, since the marking should be fool- 
proof, there is no reason why it should not be done by clerks ог 
secretaries. In large public examinations, the authorities could save 
much expense by paying professional examiners only to set; not to 
correct, the papers. The present writer finds that the construction of 
a one-hour’s new-type test in educational psychology (such as that 
appended above) takes him about twenty hours. Setting an ¢ssa¥" 
examination scarcely needs one hour. But in marking these one- 
hour examinations he can easily do thirty of the former as against 
ten or less of the latter scripts per hour. He, therefore, finds no saving 
in total hours of work unless the number of examinees approximates 
to 300 or more. But the setting of a new-type test is а fascinating 
occupation, which can be done in odd moments throughout the 
year; and the marking is simply a routine matter which involves nO 
mental strain. By contrast, the marking of large numbers of essay-type 
scripts in psychology is the most trying work that he ever has to do. 
The Guessing Factor.—Next we must admit that examinees may 
Obtain a certain number of correct answers (in a recognition-type 
test) by pure chance guessing. If they filled in answers at random 
they would, on the average, achieve half-marks on the questions 
where two responses are provided (such as the true-false type), and 
quarter-marks on the questions where four responses are provided. 
This fact disturbs the teacher far more than it does the psychologist. 
For the latter realizes that every examinee has the same chance of 
scoring, let us say, 35%, by random guessing; he therefore regards 
35% as the effective zero point and only gives credit to scores above 
this level. An alternative procedure, which comes to much the same 
thing, but is fairer in that it does not penalize the examinee who omits 
questions rather than guess them, is to subtract the total wrong 
responses in true-false questions from the total right ones; to sub- 
tract half the total wrong in three-response questions, one-third in 


| 
| 
| 
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four-response questions, and so on. That is to say, an examinee’s 


score corrected for guessing = R — , where R is the total right, 


n—1 
W the total wrong, and z is the number of alternative responses 
provided in each question. In general practice the correction is 
omitted in four- or more response questions, since it is so small.* 

But the problem of guessing is a complex one which cannot be 
entirely settled by this mechanical correction. It has aroused wide- 
spread controversy, and a long discussion of it may be found in Ruch 
(1929). Although the above correction will compensate for the 
effects of guessing in the average examinee, yet it will not do so in 
all cases, since some will be luckier than others, in accordance with 
the principles of probability. Suppose that a large group of persons 
guesses the responses to ten true-false items; the average person will 
get 5 of them right, and two-thirds of the group will get either 4, 5, 
or 6 right.* Only about one person in a thousand will be so lucky as 
to get 10 right, or so unlucky as to hit on none of the right answers. 
If the guessing correction is applied, the scores corresponding to 10, 
6, 5, 4, and 0 right will be +10, +2, 0, —2, and —10 respectively. 
In a longer test the range of chance scores which may arise from 
guessing will be still greater, although their average (when corrected) 
will still be zero. Undoubtedly, then, we must expect true-false tests 
to be somewhat unreliable, but the effect will be less marked in tests 
composed of three- or four-response items. Experimental evidence 
does indeed show true-false tests to be less reliable than multiple- 
choice tests; and the latter are less reliable than tests made up of 
recall items. But the difference between them is not large, and it may 
easily be offset by the fact that a true-false test can contain more items 
than a multiple-choice test which occupies the same length of time, 
since true-false items can be answered more quickly. Probably a 
good deal depends, also, on the skill of the tester in setting each 
particular type of question. 

Now, though we must allow that a considerable error may occur 
among the scores of a small proportion of examinees on account of 


< Note that there is no point in making any guessing correction when all 
examinees attempt an answer to every question. Their relative marks will be 
affected by the correction only when they omit some questions. In general, then, 


orion becomes less necessary the longer the time allowance for the examina- 
ion. 


* The situation is effectively the same as getting the persons each to toss a 


pd ten times, and counting up how many of them obtain 0, 1, 2, . . ., 9, 10 
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guessing, in actual practice entirely random guessing (which was 
assumed in the previous paragraph) will very seldom occur. The 
examinee is more likely to possess an incomplete knowledge of 
certain items, sufficient, however, to make his guesses right ones. 
Teachers may regard it as unfair that such an examinee will obtain 
the same marks on those items as will the more able examinee Who 
really knows the right answers. But a brief. study of recognition-tyPe 
questions, such as those given above, will show that the examinee 5 
shaky knowledge may not always be to his advantage. For the 
alternative (wrong) answers are usually worded sufficiently plausibly 
to deceive the weak student, and should indeed be based on common 
misconceptions. Hence his ‘semi-guesses’ will in practice be very 
frequently wrong. For this reason experimental researches indicate 
that the guessing correction fully compensates, or even over- 
compensates, for any advantages which the weak student might gamm 
by guessing. Other investigations prove that the amount of guessing 
may be greatly reduced, and the reliability and validity of the test 
slightly increased, by instructing examinees not to guess, and by 
warning them that wrong guesses will be penalized. 

To sum up: random guessing may considerably distort the marks 
on a recognition-type test, especially when the questions provide 
only two alternative answers. It does not actually do so to any great 
extent, partly because guessing scarcely ever is random, partly be- 
cause it can be reduced by appropriate instructions, and partly because 
such tests can include large numbers of items which make for increased 
reliability. If the recognition-type of item is to be condemned, it must 
be primarily on account of some of the other reasons discussed below: 

Suggestive Effects of Wrong Responses.—Another common 
criticism which we can refute is that the presentation of false state- 
ments and wrong answers to the suggestible minds of pupils 1 
pedagogically unsound. It is alleged that they may absorb such state- 
ments and later on believe them to be correct. This theory is an 2 
priori one, and experiments which have been devised to try it out 
have always failed to provide confirmation. Probably the fact that 
examinees know that about half the true-false statements, and more 
of the multiple-choice responses, are wrong is sufficient to offset the 
effects of suggestion. Many teachers find that there is considerable 
pedagogical value in getting pupils and students to correct their own 
papers shortly after the test, so letting them see which responses are 
wrong (cf. Ballard, 1923). 
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The Elementaristic Tendency in New-type Tests.—We come now to 
the most obvious and most frequent objection to new-type tests. Tt 
is said that they measure only the most trivial aspects of ability, such 
as details of information. The tests may indeed analyse complex 
mental operations into elements which can be objectively marked, 
but they cannot put the elements together again, and in the process 
of analysis they omit many of the most important aspects of ability. 
They cannot, as can the essay-type of examination, show the ex- 
aminee's general understanding of the subject, nor his interpretation 
of facts, his capacity for organizing and formulating his knowledge, 
nor his initiative and originality. Far from reducing the tendency of 
examinees to cram mechanically, they may increase the amount of 
rote memorization and discourage any desire to achieve a general 
grasp of the field. Investigation has, indeed, shown that students revise 
their work in a different manner when they know they are to sit a 
new-type examination, and that they concentrate on acquiring such 
bits of knowledge as are most likely to appear in the test questions. 
It is often alleged, though without proof, that secondary pupils who 
have been trained in the primary schools to pass objective selection 
tests find considerable difficulties in adjusting to more normal 
forms of school learning. 

Such criticisms embody a number of half-truths, and may be 
countered by a number of arguments. The retort may first be made 
that even if the new-type test cannot measure 'understanding, 
originality, etc. the essay examination cannot do so either. For 
when examiners come to assess such qualities, they disagree widely 
with one another. The orthodox examination achieves a reasonable 
degree of reliability when it most closely approximates to a new-type 
test, i.e. when it asks primarily for factual data, or when the ex- 
aminers draw up a detailed analytic scheme of the points which they 
Wish to mark. Secondly, it may be pointed out that examinees en- 
gaged on an essay examination do not usually shine in regard to 
Organization of knowledge and original thinking. Rather they dash 
down all the ideas and facts which they can recall, many of them 
irrelevant. ‘Thinking’ as contrasted with ‘reproduction of items of 
information’ is at a minimum. The new-type test taps all the relevant 
knowledge, with much less waste of time, and prevents the candidates 
from Producing irrelevant material. 

Such recriminations are also, however, only partially justified. 
Much sounder is the argument which points to weaknesses in the 
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critics’ psychological analysis of the contrasted examinations. They 
commonly draw a sharp distinction between information and repro- 
duction, on the one hand, and understanding and thinking, on the 
other hand. They assume, that is, that these mental processes are 
uncorrelated. In actual practice the two cannot be separated 50 
readily. The acquisition and reproduction of information always 
involves a certain amount of thinking, and no thinking is possible 
unless a person possesses information to think about. Hence, 
experiments in which attempts have been made to measure these 
faculties separately have usually shown them to be highly inter- 
correlated. It follows, therefore, that even when a new-type test 18 
apparently measuring nothing but information, it is at the same time 
providing a pretty good measure of other more complex types of 
ability. It cannot ask examinees to ‘Discuss . .., Compare . - + etc, 
пог make any overt provision for independent thinking, yet it 25 
certainly drawing to a considerable extent upon these ‘higher’ mental 
processes which the orthodox examiner desires to assess. Conversely 
a large proportion of the average essay-type answer consists of the 
sort of material which critics affect to despise when it is accurately 
measured by new-type tests. à 
The overlapping between the two kinds of mental processes 15 
further demonstrated by the fact that a new-type response, or а sen- 
tence in an essay answer, may depend upon ‘understanding’ in one 
examinee, but may result from ‘facile reproduction’ in another 
examinee, according to the respective methods by which these 
examinees have studied the topic in question. Nevertheless, it seems 
legitimate to make a partial distinction between the two types, anC 
to admit that examinees have more opportunity of displaying ‘higher 
mental processes in essay-type than in most new-type examinations: 
And we must remember that, even if they do not usually write good 
English in their essay papers, yet they would probably never learn 
to write English at all if new-type examinations became the sole 
method of testing their abilities. This does indeed tend to occur 
among primary pupils in education areas where the selection €X- 
aminations are purely objective, and some authorities have Te- 
introduced the composition examination in order to counter it. 
The reason why the criticisms of triviality and fragmentariness 
seem at first sight so justifiable and so serious is because many new- 


type tests are incompetently constructed. Questions which ask for 
bits of information are much the easiest to set, and, therefore, do 
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only too frequently constitute the bulk of home-made examinations, 
But there is nothing inherent in new-type examining which makes 
better questions unattainable. And the writer would claim that at 
least half the questions in his own test, above, do demand not only 
the possession of information, but also the application of principles 
or the interpretation of information; that is to say, ‘understanding.’ 

To conclude, then, the defect which we have been discussing is one 
which may occur in new-type tests, but need not do so to any great 
extent. Its effect upon the validity of the tests has been greatly ex- 
aggerated by critics owing to their faulty views of the nature and 
organization of mental abilities. 

Defects Due to the New-type Medium of Expression.—Our next 
objection is of a different order. It does not try to point out some- 
thing which new-type tests fail to measure, but rather to show that 
they measure a good deal over and above the abilities which we 
actually want to assess; and that this extra factor, having little 
educational significance, seriously distorts and reduces the validity 
of all new-type test results (cf. pp. 144, 168). Some of the most ardent 
advocates of new-type testing, e.g. Ballard (1923), do not seem to 
have taken this criticism into account. An excellent account of it may 
be found in Hawkes and Lindquist et al (1937). 

If the reader will analyse carefully the processes which take place 
in his own mind when answering the new-type questions above, he is 
likely to find a good deal going on besides the recollection and 
application of his knowledge of the subject. He must first study the 
instructions to discover what he has to do in each question, and then 
manipulate his knowledge in such a way as to fit it in with the require- 
ments of the question. In a multiple-choice question he may arrive 
at the right response, not because he is certain that he knows it, but 
because he has eliminated the incorrect ones. Sometimes a single 
word in the question or in the provided response will enable him to 
choose correctly, without his having to think out the full implications 
of all that is printed. Some questions may, unknown to the author, 
contain clues to the correct responses which do not require any 
knowledge whatsoever of the subject. But if the reader is a sufficiently 
Sophisticated new-type examinee, he will at once take advantage of 
such faults in construction. Examples of these irrelevant clues are 
given in the next chapter. Unfortunately, the questions which are 
most liable to contain such errors are those that aim to evoke 

understanding’ ratherthan ‘reproduction.’ Straightforward questions 


226 THE MEASUREMENT OF ABILITIES 


which ask primarily for factual information (e.g. Nos. 27-33 
and 42-47) are generally free from them. But more elaborate ques: 
tions (e.g. Nos. 34-41, 48-54) involve a great deal of ‘disentangling 
and complex inference. The ability to do such questions may depend 
rather largely on the examinee's previous experience and practice 
with such tests, or upon some unknown group factor, and not 
solely on his knowledge of the subject being tested. 

Such sources of unreliability and invalidity may be much reduced 
if the examiner is sufficiently skilled in anticipating all the diverse 
mental processes which examinees may employ, but they are likely 
to be eliminated only if he falls back on questions of the simple- 
recall type, and asks merely for scraps of information. Actually no 
examiner can foresee all the possible weaknesses in his questions, for 
investigation shows that even tests constructed by experts may.con- 
tain items whose validity is very low, or sometimes negative. Thats, 
these items are done no better, and sometimes worse, by good than 
by poor students. A difficult multiple-choice item, for example, may 
contain wrong alternatives which are sufficiently plausible to deceive 
the majority of good students, but which are beyond the compre- 
hension of the poor students, who consequently just guess, and often 
guess rightly. Or the correct answer may contain some unsuspected 
ambiguity which is not noticed by mediocre students, who therefore 
choose it, but which is noticed by good students, who therefore reject 
it. Such faults in construction are still more likely to occur in tests 
devised by amateurs. 

It was shown in the previous chapter that many of the defects of 
the orthodox examination derived from the fact that examinees have 
to formulate or express their knowledge and ability in a distorting 
medium, namely essays, and that examiners disagree in their inter- 
pretation and evaluation of these complex mental products. We see 
now that much the same is true of new-type examinations. Examinees 
have to express their knowledge and ability in a different but perhaps 
still more artificial medium. This *roundaboutness, and the com- 
plexity of the mental processes involved in arriving at their answers, 
undoubtedly reduce the validity of their marks. In examining average 
and dull pupils or adults it is not correct to claim that the new-type 
examination is unaffected by their poor literacy. For though writing 
is largely eliminated, the examinee has far more reading to do. 

The Subjectivity of New-type Tests.—A final defect in new-type 
tests, which has been entirely neglected by the majority of writers, is 
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that they are, in a sense, just as subjective as essay-type examinations. 
Only the personal element enters, not in the marking, but in the 
setting of the questions. Pullias (1937) has demonstrated that when 
two or more examiners construct their own new-type test papers, 
intending to cover precisely the same field of knowledge and ability, 
and apply them to the same pupils or students, the average correla- 
tion between the two sets of marks is only of the order of + 0:50. 
This figure may be unduly low because the examiners were teachers 
who may not have been highly skilled in test construction. Yet some 
thirty-five different teachers took part in the experiment, all of whom 
habitually employed new-type tests, and the results were very similar 
in different school subjects, and among pupils of various ages. Pullias 
further found that standardized educational tests, which were alleged 
to measure ability in the same school subjects, only yielded an 
average inter-correlation of --0:68. In some instances the co- 
efficients were decidedly higher (-+-0-8 to +-0-9), in others very much 
lower (4- 0:2 to + 0:3). 

How may we account for such results? Actually the sources of un- 
reliability described on pp. 200-7 apply here, too. These were: 

1. Actual changes in the examinees, or function fluctuation. 

2. Inadequate sampling. As already mentioned (p. 212), this 
should be much less prominent in new- than in essay-type examina- 
tions, especially if the instructions given at the beginning of the next 
chapter are followed. The trouble is not so much lack of thorough- 
ness of the questions as the subjectivity of their selection. 

3. Statistical defects. These can arise, not because different exam- 
iners adopt different scales of marks, but because the average level, 
and the range, of difficulty among their questions may differ. Such 
inconsistencies can, however, readily be eliminated, and they did not 
affect Pullias’s correlations. 

4. Subjectivity. The essay-examination is subjective because the 
examiner has to analyse the field of ability and decide which aspects 
are most important, and his personal opinion determines which 
characteristics of the examinees shall be assessed as good or bad. The 
new-type examination is subjective because the examiner has to per- 
form a similar process of analysis in choosing questions, and in pro- 
Viding the good and bad answers between which the examinees are 
to discriminate. Again, the essay-examiner is unable to maintain 
objectivity because he must translate the answers presented to him, 
or "penetrate through the medium of expression’ in order to discover 
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the examinees’ real merit. The new-type examiner carries out an 
equivalent translation when formulating his questions. Hence differ- 
ent examiners will formulate questions about a given topic differ- 
ently, in just the same way that different essay-examiners will trans- 
late differently the answers to an essay-question on this topic. 
Pullias points out that the main stages in examining—(a) con- 
Structing questions, (6) answering questions, (c) evaluating ans ica 
in reality constitute a single whole. The new-type examiner has, Е 
deed, eliminated subjectivity from (c), but at the cost of greater sub- 
jectivity in (а). And his personal opinion has inserted itself into 0), 
since his questions embody a good deal of what, in essay-examina 
tions, is normally carried out by the examinees. j ү 
Now that we recognize this subjectivity as the chief flaw in д 
type tests, we should be able to control and reduce it more кай 
than in essay-examining. Nevertheless, we must certainly Cone al 
that both types of examination are imperfect mental measuring 1 А 
struments, and open to much the same objections. Con y 
evidence as to their relative validity is not easy to obtain, since usua b 
it is only possible to compare their marks with the marks on 2 
sequent, equally fallible, examinations. But by pooling the ue 
from several papers of both types, from class teachers' estimates, Prel 
a fairly good criterion may be achieved. And according 19 pu 
neither the essay-type nor the new-type is definitely Bena dity 
Lee and Symonds, 1933). The evidence of the superior và 1 iu 
of new-type tests which Ballard (1923) puts forward is quite une ely 
vincing. In Secondary school selection conventional, Suse 
marked examinations are certainly inferior to че tests. 2: 
when the marking is standardized there is little to choose; ерше ыч 
the old type, sometimes the new, show better correlations with + 2 
Secondary school achievement, Certainly there is no proof that hip 
application of new-type tests in university entrance or scholars 
examinations would give appreciably better or poorer predictions E 
ultimate university success than at present. Since, however, the erro { 
which reduce the validities of the two are different, it follows Has 
they measure somewhat different aspects of ability. Hence by ЊЕ 
bining them the respective errors partially cancel one another out, 
and the total validity is definitely superior to that of either of the 
separate types. And as certain features of a scholastic subject may ђе 
more readily tested by the one, other features by the other, each 
should be employed in the spheres to which it is best adapted. 


CHAPTER XIII 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 


THE first stage in the construction of a new-type examination should 
be the preparation of a detailed statement of what the examination 
is intended to measure. A list of the topics which it is to cover should 
be written out, and a rough assessment made of their relative impor- 
tance, for example ona 1 to 5 scale. Suppose that the sum of all these 
assessments comes to 75, and that the examination is to last two 
hours. Probably about 150 questions will be needed. In that case 
each topic assessed as 1 in importance should eventually be covered 
by two questions, each assessed as 5 by ten questions, and so on. 
Naturally this distribution of questions need not be enforced mech- 
anically; but the principle of allotting most questions to the most 
significant parts of the field is important. If the test is designed simply 
to show the acquaintance of students with a particular textbook, the 
questions dealing with each topic may be proportioned to the number 
of pages on the topic in the textbook. 

We have already stressed the point that questions should not deal 
With trivial details merely because these are the easiest to set. It is 
very unwise also for teachers to take sentences straight out of text- 
books and turn them into true-false or completion items. If this is 
done, the examinees will assuredly begin to study their textbooks 
with an eye to such sentences, and will try to learn them off by rote 
rather than to understand them. Textbook sentences may occasion- 
ally be adapted as wrong answers in multiple-choice questions, since 
they may then trap the parrot-learner. 

When the teacher and the examiner are one and the same person, 
suitable questions, or ideas for questions, are likely to occur to him 
while he is conducting the work of the class. These should be noted 
down and filed; they will save him much time and effort when the 
€xamination has to be made up. No examiner should expect to be 
able to devise a complete test at one sitting, unless it is intended to 
last only about a quarter of an hour. He will soon discover for him- 
self his speed of work, and will set aside sufficient hours, on a number 


f different days, to enable him to carry out the construction 
M.A.—16 
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efficiently. The time needed will, of course, be much reduced if ће 
can take over some of the questions from previous papers which he has 
set in the same subject. This is often possible since the papers are 
usually collected at the end of the examination, and not allowed to 
circulate freely round the school or university. Before he enters upon 
the final stage of construction, he should have in hand roughly double 
the number of questions he is likely to need. Many will have to be 
rejected for one reason or another, hence he requires plenty of spare 
ones. 

To reduce the subjective element in setting, it is desirable for 
Several examiners to co-operate in supplying questions. In an institu- 
tion or department where examinations are frequently set to larg 
numbers, it is useful to build up a pool of questions from which a 
number of new-type papers can be drawn as the need arises. Each 
question is recorded on a card in a classified file, with details as to 
the examinations in which it has been used, and any data regarding 
its difficulty and validity. Periodically, questions that have been used 
too often are withdrawn, and fresh ones added to keep up with any 
alterations in the course of instruction, s 

The choice of questions for any one paper is governed by ашы 
a different conception of difficulty from that which prevails in ee 
ditional examinations. In accordance with the principles given 1 
Chap. II, the test should be so arranged that the poorest examinees 
will obtain marks very little above 0%, the best ones very little 
below 100%, and so that the average will be as near as possible to 
50%. (These figures may then later be converted into some more 
conventional scale.) A few questions, therefore, must be 59 саѕу 
that almost all (say 95%) can manage them, a few so difficult 
that scarcely any (say 5%) can manage them. Moreover, all inter- 
mediate grades of difficulty must be adequately represented. BY 
contrast, all the questions in an ordinary examination are usually 
arranged to be roughly equivalent in difficulty. It follows that 
in а new-type examination, several sections of the work may be 
omitted altogether if they are likely to be known so well that more 
than 95% of the students would answer questions on them correctly. 
It is a pure waste of time to include items which everyone can do. 
Conversely, sections of the work may have to be omitted if even the 
best 5% are unlikely to be able to answer questions dealing with 
them; such questions would also be useless. 

Thus the examiner's next step must be to grade each of his tentative 
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questions for difficulty, i.e. to assess what proportions of examinees 
are likely to get each one right. The grading may be quite coarse, e.g. 
A = 5 to 20%, B = 21 to 40%, C — 41 to 60%, D — 61 to 80%, 
E = 81 to 95%. These estimates are likely to be highly inaccurate at 
first, and to diverge widely from similar estimates made by other 
examiners. But if he takes the trouble to check his predictions after 
each paper, and notes how many examinees did actually pass each 
item, his capacity will soon improve. The final version of the paper 
should contain approximately equal numbers of A, B, C, D, and E 
items.! 

What particular form of new-type question to adopt depends en- 
tirely upon the subject-matter. Some things are more readily ex- 
pressed in one form, some in another. No certain evidence of the 
superiority or inferiority of any typeis as yet available. Experiments 
on thi$ matter have been carried out ; but the skill of the authors of 
the items has not been controlled. Suppose, for example, that mul- 
tiple-choice items are found to be more reliable and valid than true- 
false ones, this may merely show that the experimenter is himself 
better at compiling multiple-choice items. Nor can we definitely 
claim that different types measure distinctive abilities. The correla- 
tions between two multiple-choice tests, or two truc-false tests, do 
not appear to be higher than those between a multiple-choice and a 
true-false test. This indicates that they both measure the same thing. 

There may, however, still be some advantage in probing the mind 
by a variety of techniques. And several different types may be em- 
ployed in a long paper so as to reduce monotony. In a short paper 
this is undesirable, since each type needs to be prefaced by full in- 
Structions, and examinees may have to spend too large a proportion 
of time in reading the instructions, and in readjusting their ‘mental 
Set.’ The following rule may be taken as a rough, though not a rigid, 
guide. If matching items are included, there should be at least five of 
them, each including 5 to 10 responses. If multiple-choice items are 
included, there should be at least ten of them. If true-false, simple 
recall, or completion items are included, there should be at least 
twenty of each. The numbers of other types of questions should be 
comparable to these figures. 

DIR however, the object of the examination is to select or cut off, say, the best 


20%, or the poorest 30%, then maximum efficiency will be obtained by aiming 


as many questions as possible at these respective levels of difficulty, and not 


bothering about their suitability for spreading out the marks of the majority 


(cf. p. 31). 
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When the questions have all been chosen they should be rearranged 
so that all those of any one type are grouped together. In each 
group the questions should be arranged in estimated order of 
difficulty, starting with the easiest. If this precaution is omitted, the 
poor and medium examinees may waste a lot of time attempting 
difficult items at the start, and fail to reach easier ones which they 
could do, and the reliability of their results will be reduced. 

We will now consider each of the main types. 


SIMPLE RECALL AND OPEN-COMPLETION TYPE 
: 5 e 
A question or short problem is followed by a blank space AEST 
the answer is to be written. Alternatively certain words or Sho 


phrases are omitted from a sentence or paragraph, leaving blank 
spaces to be filled in. For example:! 


ТА. Who was the inventor of the steam engine? .......- 
Ів. The inventor of the steam engine was 


е 


49 9 UY 
2. If y = 2x8 wert 3: what is 227 АНИ 


ШОЛ Ызы? A positive electric charge is carried by the ..-----* ? 


a negative charge by the ........ A balanced action takes D 
whereby XY 2* ........ Шен theory provides an expla 
tion of the 


Ronan of electricity by salt solutions. _ Dus 
5. ^ geography test may consist of a map on which num ud 
but no names, are printed. Below is a column of these numbers ane 
the examinee is instructed to write opposite each the town ог tin 
which it represents. The same technique can be used for tes 


2 ism OF 
knowledge of the names of parts of some instrument, mechanism 
biological specimen. 


Each correct answer receives one mark; thus 6 marks may be 
scored on Question 4. d 

The advantages of this type of item are, first, its naturalness 2 > 
its similarity to ordinary questioning; secondly, its freedom fr o i 1 
chance guessing effects which occur in closed or recognition items» 
and thirdly, the facility with which it may be constructed. This 
facility, however, is somewhat delusive, and unsuspected flaws are 


? The author feels that an apology may be needed for the puerility of some of 
the illustrative questions in this chapter. They are intended to show the appli 
tion of the technique to various school and university subjects, at varying s 
of difficulty. Preferring to construct fresh ones rather than to quote from pu 
lished tests, he is limited by the rustiness of his own education. 
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very likely to creep in. The chief disadvantage of the type is that its 
scope is practically limited to rote knowledge, or skill at simple 
arithmetical and scientific problems. 

A number of hints on construction may be listed as follows. 

(a) Try to ask only for important items of knowledge. 

(b) Ensure as far as possible that there is only one right answer, 
and therefore confine the answers to single words or numbers, or to 
very short phrases. Some of the blanks in Question 4 violate this 
rule. The last one might be filled by ‘conduction,’ ‘carriage, or 
‘carrying.’ If there is any likelihood of alternatives, the examiner 
should list them all beforehand, and decide which ones to allow as 
correct. In this instance he would have to decide whether to disallow 
'passage.' Examiners may find it easier to conform to both these rules 
if they think of the answer first, and then build up the question or 
sentence around it, in such a way that the answer is completely 
delimited. 

(c) Keep the questions and sentences as short and as unambiguous 
as possible, and in particular avoid over-mutilation of a completion 
sentence by inserting too many blanks. A completion sentence should 
always belooked on as a form of question. See therefore that sufficient 
words are left to make it an intelligible question. For the same 
reason, do not put blanks near the beginning of a sentence, which 
can only be answered by consulting later clauses of the sentence, or 
later sentences in the paragraph. Most of the blanks should be at the 
end of the sentences. 

(d) Study the whole content of a sentence or question to see that 
all of it is essential for the production of a correct response. In 
badly constructed items, it may be possible to deduce the response 
simply by the exercise of intelligence, without any knowledge of the 
Subject-matter. In others the correct response may be given away 
Somewhere in a later sentence; or hints may be derived from the 
grammatical construction. For example, blank spaces should not be 
Preceded by ‘a’ or ‘an.’ An irrelevant clue is provided by the ‘an’ in 
the following example: 


6. A piece of land entirely surrounded by water is called an 


This difficulty could be overcome by wording ће ѕепіспсе as a 
question, or by printing ‘a(n)......-- 
у (е) Do not vary the length of the blank space according to the 
Size of the answer expected, as this would also supply irrelevant 
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clues. See that the spaces are big enough for examinees with large 
writing. 

(f) For purposes of easy marking, it is desirable to arrange for 
answers to be written in the right-hand margin. It is much more 
difficult for the eye to pick out answers scattered all over the page 
Even the following is rather inconvenient: 


Write down the author of each of these plays. 
7. The Admirable Crichton 
8. Candida 


" : Eri ate 
One solution is to number the blank spaces and to print a CN 
column of numbers opposite which the answers are written, 


Write down the authors of each of these plays. 
7. 'The Admirable Crichton Е 
8. Candida 


5% 


у d 
A paragraph-completion test, with scattered answers, may be ae Д 
more efficiently by constructing a stencil of cardboard, with iaid 
through which only the answers are visible when the stencil is 


A9 1 the 
over the test paper. Under each hole on the stencil-is written 
correct answer. 


TRUE-FALSE TYPE half 
This generally consists of a set of statements, approximately P һ 

of which are true, the rest false. The examinee has to indicate IS 

is which. In a spelling test the ‘statements’ may be reduced to sing 


A f HEOL 
words, whose correctness or incorrectness is to be checked 
example: 


Write a + after each of the following words which is correctly 
spelled, and a — after each one wrongly spelled. 

9. Medecine ........ 

10. Embarrassment 


11. Accelerate 
12. Sieze 


A modification which is useful for tests of spelling, grammar, and 
literary knowledge is to instruct the examinee to cross out, ог under 5 
line, the one word in each of a set of sentences which is ungra™ 
matical, or wrongly spelled, or wrongly quoted. One example 9 
each is given.* 


1 These should perhaps be classed as multiple-choice items, since the examinee 
has to pick out the incorrect word from about half a dozen correct ones. 
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14. I met him and she in the town. 

15. The ruffian demolished all the aparatus in the laboratory. 

16. Rule, Britannia ! Britannia rules the waves. 

The ordinary true-false type has been commonly used in tests of 
all subjects, because it is capable of sampling very quickly a wide 
range of knowledge, and because examiners find it easy to construct. 
But the uncertainties due to guessing (cf. pp. 220-2) and, still more, 
its liability to unsuspected errors, have recently made it less popular. 
Some recommend that it should only be used when the multiple- 
choice type is ruled out for lack of plausible alternatives. Thus the 
following are justified since each involves only two possible answers: 

17. Distance east or west of Greenwich is 

known aslongitude. True False 
18. When summer time starts the clocks 
are put back one hour. True False 


There is no need to confine true-false items to rote information. 
Indeed, questions of general principle, of interpretation and the 
like can well be expressed in this form. Great care is needed, how- 
ever, to see that they do not violate some of the following principles 
of construction. 

(a) Statements should be simply worded, otherwise they are 
liable to contain sections which are true and sections which are false. 
In general it is best to use sentences which contain only a single 
clause, embodying only a single idea; to word them positively; and 
to refrain both from negatives and from conjunctions such as 
‘because’ and ‘therefore.’ The following examples are ambiguous, 
and therefore unsatisfactory. 


19. Napoleon lost the battle of Waterloo 


because his army had insufficient ammunition. True False 
20. Beethoven was a great composer who 
Wrote many symphonies, operas, and sonatas. True False 


The most crucial part of a statement, which renders it true or false, 
Should usually come near the end. 

(5) It is not easy to construct difficult statements, nor to estimate 
their degree of difficulty beforehand. They need to be sufficiently 
Plausible to deceive the examinee whose knowledge is incomplete, 
and therefore not too obviously true or false. And the true ones 
Should be iust as likely to appear false as the false ones true. At the 
Same time they must be clearly true or false to the able examinee, and 
therefore must not contain ‘tricks.’ Good students look out for 
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essentials, and so may be tripped up if errors are introduced into 
minor details. The following example is bad for this reason. 


21. Water boils at 212° and freezes at 32° C. True False 


(c) It has been found that certain forms of wording are much 
more apt to be used in true than in false statements, or vice ors 
Examinees soon get to recognize these, and so may answer correc” 3 
without any knowledge at all of the subject-matter. The great јр y 
of statements, in amateur new-type tests, which contain the wor => 
"АПУ ‘Always,’ ‘Never,’ ‘Impossible,’ are false. The majority of thos 
which contain ‘Usually, ‘Often,’ and the like, are true. Most v 
long statements also tend to be true more often than false. TUER 
no need to discard the above words altogether, but they should O 
used with caution and, if possible, introduced as frequently in 
true as into false statements. c 

(d) If the number of statements is less than, say, twenty, ЦБ 
Should not be an exactly equal division between true and false, € $ 
examinees will count up to make sure that they have checked the 
half and half. be 

(е) The order ifi which the statements appear must, of course, 5. 
entirely random. If the examiner never prints more than two К 5 
two false statements consecutively, examinees may again disco in 
this and take advantage of it. One way of arranging is to toss à ie 
If it shows heads, take the easiest of the true items as No. 1. If 
next four tosses are tails, take the four easiest false statements 
Nos. 2 to 5; and so on. d: 

(f) Various methods for recording the responses have been use је 
underlining the words True ог False, or Yes or No, drawing а Ed 
round the letters T or F, or writing the letters T or F. One of ae 
simplest and most satisfactory is to write a + or a — in a oe 
space at the end of each statement. These spaces should an 
aligned in a straight column in the right- or left-hand margin. be 

(g) The correction for guessing, described on p. 221, should 1 
applied, and examinees should be warned of this in the gener 
instructions at the beginning of the test. 


MULTIPLE-CHOICE TYPE, INCLUDING BEST REASON AND 
MATCHING ITEMS 
A large number of variants of this type may be distinguished, 
examples of most of which are quoted below. 
(i) The number of alternative responses is usually five, but it may 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 237 


range from seven to three, or even two, according to the number of 
possibilities implicit in the question. Six- or seven-response questions 
occupy rather a lot of space. Three- or two-response questions should 
usually be avoided because they require the application of guessing 
corrections. In any one new-type test it is convenient, though not 
really necessary, to provide the same numbers of responses for all 
the questions. 

(ii) When the responses consist of single words, they can be 
printed all on one line, consecutively. They should be enclosed in 
brackets, and preferably be printed in italics. The examinee is then 
instructed to underline the one word in each set of brackets which is 
correct. For example: 

22. They (was, were) going (to, too) London (too, to) visit (their, 

there) aunt, (who, what) lived (there, their). 
When, as in most of the remaining examples, the responses consist 
of several words or phrases, underlining is troublesome both for the 
examinee and for the marker. The responses may sometimes be 
printed consecutively and numbered, and the examinee writes the 
number of his choice in a blank space. For example: 

23. What is the function of the sacral autonomic nervous system? 
(1) To increase circulation and respiration. (2) To control posture 
and movements of the limbs. (3) To increase the activity of the 


reproductive organs. (4) To contract the sphincters of the excretory 
organs. (5) To inhibit the activity of the alimentary canal. 


This type of question is, however, inconvenient to read, hence it 
is more usual, as in the examples below, to print the responses ina 
column. The examinee indicates his choice by a cross in one of the 
blank spaces in the margin. Unfortunately this method tends to 
consume a lot of space. 

(iii) The items may be stated in question or in completion form. 
Thus Question 23 might equally well commence: 
ue The function of the sacral autonomic nervous system is: 

(iv) In the ‘best reason’ type, the examinee is instructed to check 
the best of a set of reasons for a given statement. Occasionally a 


More suitable item may consist in finding the worst reason. For 
example: 


24, Many investigations indicate that children’s intelligence is 
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somewhat affected by environment and education. Which of the 
following provides the poorest evidence for this conclusion? 
The correlation between the intelligence of orphans 
and their parents is lower than that between 
children and their parents who have reared them -1:7 
The average difference between the I.Q.s of pairs of 
identical twins is higher among those reared apart 
than among those reared together Е . 
Children tested before placement in foster homes, 
and again after several years, show a greater 1n- 
crease in intelligence in the better than in the 
poorer homes 2 T : E О 
Children of professional parents are on the average 


markedly superior in intelligence to children of { 
unskilled labourers 5 : у : пи 


(у) Two ог three responses, instead of one, may be шег. 
Occasionally none of them may be correct. The instructions Da 


: 5 r 
make it absolutely clear to the examinees what is wanted. Fo 
example: 


25. Put a cross opposite two of the following English рашы 
who are especially famous for their landscapes: 


Constable ...x... Reynolds  ..«. 
Hogarth — ........ Romney ..:. 
Lawrence лр Тигпег Е 


26. Put a cross opposite one answer only. 
Sir Thomas Browne is best known as the author of 
Fiction 
Plays 
Poetry . d зо 
Works of travel . ......-. 
None of these D Die o 


: ; ог 
(vi) The same set of responses may apply to several questions. F 
example: 


Р ircle 

Identify the following orchestral instruments by drawing Oe 
round § for strings, W for woodwind, B for brass, and P fo 
cussion : 


27. Clarinet У vas 
28. Triangle  . : б 
29. Double bass 3 8 
30. English hora $ B 
31. French horn 5 


22:29 
(siti 9 vd 


(vii) Matching tests are an extension of the previous category- A 
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number of questions and a number of responses are listed in different 
order, and have to be fitted together. For example: 


32. On the left is a list of Treaties or Peaces, on the right a list 
of historical developments which were associated with these Treaties. 
After each of these developments write the /etter of the Treaty to 
which it corresponds. Note that some of these letters need not be 
used at all, and that some may be used twice. 


Peace or Treaty of: Historical Development: 

(a) Berlin Napoleonic empire broken up а 
(b) Frankfort League of Nations established Мај mrt 
(c) Paris German annexation of Alsace- 
(d) Tilsit Lorraine SO DER 
(e) Utrecht France regains Alsace-Lorraine SUE EIL 
(f) Versailles Independence of Holland finally 
(g) Vienna recognized заба 
(А) Westphalia Britain becomes supreme colonial 

power CÓ 

Western powers set the stage for 
Balkan wars o8 onttl o0 


Many matching tests have equal numbers of items in both columns, 
each item corresponding to one, and only one, in the other column. 
Since, however, this makes possible the discovery of responses by a 
process of elimination, the above plan, with unequal numbers, is 
superior. The numbers of items should usually be five to twelve in 
each column; a greater number involves waste of time in searching. 
When possible the items in at least one column should be in alpha- 
betical order, to facilitate the search. It may be noted that this type 
is more compact than ordinary multiple choice, and so saves con- 
siderable space. 

The multiple-choice type of question is the most generally ap- 
plicable in educational testing—to diagrams, maps, etc., as well as to 
verbal or numerical problems. Not only factual information, but also 
quite elaborate reasoning and interpretation can be measured by it. 
In the best-reason type, for example, the alternatives provided may 
be graded in reasonableness; but only one answer should be com- 
pletely correct. Again in matching tests, the examinees can be re- 
quired to apply a series of principles to a series of problems. Very 
great care, however, is needed in the construction, and the following 
hints may be useful. 

(а) The multiple-choice type should not be used when simple- 
Tecall questions would be adequate, ie. when there is only one 
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definitely right answer to the question. For example, it would be a 
waste of time to express arithmetical problems this way: 


33. о oS 1, 14, 2. | 
Indeed, most mathematical problems are better in open than 1n 
closed form, since the latter may enable the examinee to work back- 
wards from each answer until he reaches the right question, which 15 
a very different matter from working forwards from the question. 

(6) All the responses provided must be sufficiently plausible to be 
selected by a fair proportion of the examinees. Otherwise there is по 
object in including them. The incorrect alternatives can often be 
made up out of misconceptions and errors which are likely to 
be prevalent among the examinees. Sometimes, also, they will be 
selected by many of the poorer examinees if they contain several 
words or phrases identical with words or phrases in the question oT 
introductory statement. Both correct and incorrect responses should; 
however, be homogeneous in their mode of expression, length, and 
other external characteristics. If the correct answer is always longer, 
or always shorter, than the incorrect ones, examinees may discover 
this and employ it to their advantage. 

(c) Each response should be studied for possible irrelevant clues. 
For example, each one must be grammatically consistent with the 
question. This is especially necessary in matching tests. Some testers 
throw together such heterogeneous items that the examinees may be 
able to match them without any real knowledge of the subject. The 
following is an extreme (hypothetical) instance of a bad item whieh 
although supposed to show knowledge of geographical terms, coul 
be answered correctly by inference alone. 


34. (а) A geographical line along 
which atmospheric press- 
( ures are equal 
b) Scottish lakes or sea inlts — | Ak, ле ius 
(c) Winds which blow at a cer- on 
tain season in the Indian 


Watershed  ....-,** 


О ШЅобаг м же И а ind 
cean _ 
(d) A high piece of land Ош Gontour lines... 
which rivers flow in two 
or more directions TOS Mae aes 


(e) Lines on a map showing 
heights above sea-level 


‘Contour lines must be (е), because ‘lines’ are mentioned in (e). 
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"Monsoons' and ‘lochs’ must be either (b) or (c), since these are the 
only remaining plurals. It is unlikely that ‘monsoons’ would be 
printed so nearly opposite its corresponding item, moreover ‘lochs’ 
sounds like ‘lakes,’ hence ‘lochs’ must be (6). (a) and (d) mention 
"geographical line’ and ‘rivers,’ hence ‘isobar’ and ‘watershed’ will 
probably fit them. 

The examiner should remember that a matching test is an ex- 
tended form of multiple-choice item, and should therefore see that 
for each item in one column there are at least three items, preferably 
four or more items, in the other column which constitute possible 
answers. 

(d) Not infrequently a better item may be obtained by turning 
round a partly formulated question into an answer, and making the 
correct answer into a question. For example: 


35. Emphasis by means of an understatement is called: 
Meiosis 5 ile Sea ОБ 
Hyperbole . He Ба 
Euphemism . Sy arteries 
Metaphor . ee b es 


This might be re-formulated as follows: 


36. Meiosis denotes: 
Emphasis by means of an understatement . STH Ae 
An exaggerated rhetorical expression К 
Substitution of a pleasant for an offensive expression 
An imaginative simile 5 . Е E 


Probably the second form (No. 36) requires a fuller understanding 
of the subject than does the first (No. 35). A third, perhaps still 
better form of the question, which would demand the application and 
not merely the possession of knowledge, would be a set of phrases 
from among which examinees would select the one showing meiosis. 
In general the examiner should study each item in an attempt to 
anticipate all the mental processes by means of which the correct 
answer may be reached. If any of these processes are not representa- 
tive of the ability that he wishes to measure, then he should modify 
the item. 

(е) All statements and answers should be as succinct as possible, 
consistent with accuracy. Very verbose or complicated questions 
may depend more on the candidate's reading ability and intelligence 


than on his knowledge of the subject. (Possibly Qu. 24 offends in 
this respect.) 
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(7) It should hardly be necessary to point out that the position 
of the correct answer in the series must be chosen entirely at random. 


First and last places should be employed as often as any of the 
intermediate places. 


REARRANGEMENT TYPE der 
A series of items which normally fall into a certain specific or 


: + + + t e 
is listed in disarranged order, and the examinee has to rearrang 
them. For example: 


37. Below is a list of waves or rays. Opposite No. 1 in the ns 
hand margin, write the Zetter (a, or b, or с, etc.) of the wave o the 
with the longest wave-length. Opposite No. 2 write the letter © 


j wn 
ENG TEMA with the second longest wave-length, and so on €? 
о No. 7. 


(a) Rays from a sodium vapour lamp Lex coe РА ў 
(6) Waves from a black kettle containing hot 2. 22. 
water ик? b.. 
(c) Gamma rays ; у 3. Bids: 
(d) Television transmission waves ~ ~ T AVR 22 18 
(е) Кауѕ from the green of the spectrum 5 "RENS 
(f) Wireless telephone waves Mp: ce 
(g) X-rays TEE 


The mental processes involved in answering this question A 
naturally depend on what has been taught. If the pupils have then 
provided with such a list, they may answer by rote recall ; эй d re- 
they may have to perform an elaborate process of analy ESOS 
synthesis of their knowledge of various branches of physics. 923) 
By disarranging the order of phrases in sentences, Ballard (1 RA 
provides a test which, he claims, measures knowledge of Se 
Structure and of the canons of English style. The time xU. 
historical events, the sizes of geographical areas, the order of eee 
tions in some chemical or physical experiment, grammatica Е 
quences in foreign languages, and the like, can readily be ep y 
in this type. It has not, however, been very widely used, proba о 
because of the difficulty of devising a simple and just method 
scoring (cf. Sims, 1937). Ballard's (1923) plan seems fairly satisfactory: 
Award one mark if the first item is correctly identified; thereafter 
award one mark for each item which correctly follows the One 


1 The reader may wonder why the instructions are not simplified by requiring 
examinees to write 1 after the item that should come first, 2 after the next, a 
so on. The reason is that the scoring of such a response would be exceedingly 
time-consuming, unless the whole of it was correct. 
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preceding it. For instance, if an examinee answers Question 37: 


(0 (d) (а) (е) (s) (0) © instead of (f) (d) (b) (a) (е) (8) (с), 


(f) is correctly placed first and scores опе; (4) correctly follows (/) 
and scores one; (a) should not follow (d), but (e) and (g) are in 
correct sequence, and so score two; (c) does not score, although it is 
correctly placed last, because it should not follow (5). The total 
score here is thus 4 marks. 

Certain other minor types of question have been described by 
Ruch (1929), Rinsland (1938), and others. But these all appear to be 
variants of the simple-recall or completion, true-false, or multiple- 
choice types. 


PRACTICAL EXAMINATIONS 


Probably one of the most unreliable of conventional examinations 
is the practical in physics, chemistry, and biology and the engineerin g 
trade test; because a 3-, or even a 6-, hour period can provide only 
such a limited sampling of the skill or knowledge of the examinees. 
Though such examinations cannot be converted to wholly objective, 
multiple-choice form, it was found by American institutions for 
technical training of recruits during the war that great improvements 
were possible along the same lines. The operations at which examin- 
сез are supposed to be competent are analysed into a series of steps, 
and some 30 to 50 typical problems are selected, each of which can 
be done in a few minutes. For example, in electricity, a certain circuit 
is nearly completed, and the examinee has to make one essential 
connection which will show his grasp of the circuit without his hav- 
ing to do the whole wiring. In chemical analysis, he may be told that 
such and such reagents have given such and such results ; what should 
he do next? Several questions in an engineering examination may be 
based on capacity to choose the right tools for a given job. In others, 
parts of machines may be laid out and the examinee required to 
indicate how they should be repaired, and so on. Each of the fifty- 
Odd questions needs to be supervised by a separate examiner; 
Senior students can often be employed for this purpose. The ex- 
aminees move round from one problem to another in turn, as many 
being examined simultaneously as there are problems. Each problem 
15 usually marked as right or wrong, the right response being care- 
fully defined beforehand; though occasionally half-marks may be 
Blven if partial competence can be adequately defined. 
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FINAL STAGES IN THE CONSTRUCTION OF A NEW-TYPE 
TEST 


The examiner should rescrutinize every item, try to eliminate ik 
ambiguities, and ensure that the whole of each item 15 вош 
function, i.e. that every word or phrase will be essential іп аыр 
at a correct response, and that there is no possibility of оры 
correct responses through irrelevant clues (cf. Hawkes and тайдан: 
et al, 1937). This can be done much more effectively if the ae 
are submitted to other experienced examiners who may de 
flaws which the author has not noticed. * each 

Next it is essential to make up comprehensive instructions for rds 
type of question, which will be appropriate to the mental level an ~ 

_ dullest examinees. Unless the examinees аге fairly оре ins 
new-type testing, sample questions already answered should е les, 
cluded. In classroom work it is a good plan to provide three Md 
one already answered, to read this through with the class inne to 
out why the given answer is correct, and then to ask the d 
suggest answers to the other samples, helping any individual 7 
bers who still do not understand what is wanted. 1 11 the 

Attention should then be directed to ease of scoring. Ц i key 
answers can be aligned in a column in the right-hand margin, z the 
can readily be prepared consisting of a sheet for each page SE 
test, down the left-hand edge of which are written the correct ans 
in the correct positions. n 

It might be supposed that as some questions are harder i 
others, they should be allotted more marks. But actually UT he 
experiments show that weighting is hardly worth the Hone hted 
unweighted total scores correlate almost perfectly УП we be 
totals. A very simple form of weighting such as the following E n 
used: award 1 mark to each correct simple-recall Or come a 
response, also to each correct response in a matching роте a an 
mark to each true-false or two-response multiple-choice item nse 
apply the R-W correction; give 1 mark to each three-respo ive 
multiple-choice answer, but neglect the correction for guessing: = 


5 nses 
2 marks to each multiple-choice item where four or more respo 
are provided. 


i г, 

Although scoring is so rapid and fool-proof, slips do CE 

particularly in adding up totals. Either then the marker should 3 
through the papers twice, or else let the students or pupils re-mar! 
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their own papers and ask them to bring any errors to his notice. 
Finally the examiner will certainly improve his future examining 
if he takes the trouble to study the difficulty and the diagnostic value 
of each item. A simple method is to sort the papers into four piles— 
the quarter with the highest totals, the quarters above and below the 
median, and the quarter with the lowest totals. The numbers of times 
each item is correctly answered by members of each group, and by 
the four groups combined, may be tabulated. The examiner may 
then compare the actual difficulties of the items with his estimates of 
difficulty made when setting the paper, and may note whether the 
proportions of passes for each item ascend regularly from the lowest 
to the highest group. If they fail to do so he should attempt to 
discern the flaw in the item which may be responsible for its poor 


validity. 


M.A,—17 


CHAPTER XIV 
IMPROVEMENTS IN EXAMINING 


EXAMINATIONS have far too great a burden to bear. They are sup- 
posed to indicate, as shown in Chap. XI, a large number of only 
Partially related traits and abilities. If we could define more clearly 
what it is that we want to measure, and use for each purpose the 
Most appropriate instrument, there is no doubt that we could effect 
а great improvement in the efficiency of these instruments. Let uS 
consider first some of the substitutes for examinations which have 
been recommended by educationists, and the functions which they 
may serve. 

_ Oral Examinations, Viva Voces, and Interyiews.—Oral examina- 
tions аге, of course, much older methods of educational measure 
ment than written examinations. Their main defects were pointed 
out when written examinations began to take their place, over оде 
hundred years ago. It was recognized that the situation to which 
Successive candidates are subjected cannot be the same for all, 50 
that no sure basis exists for a comparison between different can 
dates, The examiner’s tone of voice, the wording of his questions: 
his encouragement or criticism, etc., may vary widely from p. 
candidate to another. If, as in school inspections, all the candidates 
are interviewed in the presence of one another, the examiner has 19 
Pose a series of different questions which are unlikely to be closely 
comparable in difficulty. In any viva voce the amount of time de- 
voted to each candidate is usually so short that reliable indications 
of his abilities can hardly be expected. There is the further defect that 
the situation may cause much greater emotional disturbance шщ 
does a written examination. The interviewer may wish to take сег taui 
qualities of temperament and character into account in his assess- 
ments, but the emotional qualities which are most prominent during 
the interview are probably not very typical of the candidate’s usta 
behaviour, nor of any great significance for educational purposes: 
When care is taken to put the candidate at ease, skilful questioning 
may, indeed, reveal something of his interests, his attitude to work: 
and cultural background, which are educationally relevant. But а5 
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often applied in 11- selection, the secondary school Head’s inter- 
view probably does little more than assess social class and nice 
manners. Oral examinations in foreign languages will bring out the 
candidate’s goodness of pronunciation, but cannot be trusted to 
show his facility in conversation, because of the distorting effects of 
emotion. For the same reason, viva voces in other subjects cannot 
throw much light on the candidate's capacity for formulating his 
knowledge orally. Theoretically they should constitute a valuable 
supplement to written examinations, since they provide a fresh 
medium of expression for ability. But investigations have shown that 
judgments based on interviews tend to be still less reliable than those 
based on essay-examination answers. 

Hartog and Rhodes's (1935) study of the Civil Service interview 
is particularly apposite. Sixteen candidates were interviewed in the 
ordinary way by two different boards, each composed of four or five 
experienced examiners. The members of each board consulted 
beforehand as to the questions they would ask (but the two boards 
did not inter-communicate), and had before them the candidates" 
educational records. During the quarter- to half-hour interviews the 
members attempted to assess a candidate's alertness, general in- 
telligence, and the suitability of his personality for the Civil Service, 
After discussion they drew up a final mark. It was found that mem- 
bers of any one board did agree fairly closely in their judgments, but. 
that the two boards had reached widely differing conclusions. The 
two final marks differed by 12% for the average candidate, in one 
case by as much as 319%, in another by as little as 1 76. The correla- 
tion between the two sets of marks was only --0:41 + 0-14, Except 
for the small number of the candidates, there seems to be no 
technical flaw in this experiment which would allow us to challenge 
its depressing results. 

There have been many other studies of the interview in the 
educational, vocational, and military fields, showing similar in- 
consistencies among interviewers (cf. Vernon and Parry, 1949), In 
view of the frequent use of the interview. by university authorities for 
Selecting students, Himmelweit and Summerfield's (1951) study is 
very relevant. They obtained almost zero correlations between pre- 
dictions made by. university staff from interviews and Subsequent 
degree performance, whereas an academic entrance examination gave 
small positive correlations and a battery of psychological tests 
distinctly better ones (cf. also Vernon, 1953). It would seem that 
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oral questioning and the interview, though essential in education for 
purposes of teaching, are useless for purposes of assessing ability or 
the results of teaching. In the course of a school year a teacher can 
undoubtedly build up a fairly comprehensive and reliable picture of 
each of his pupils’ abilities, which may be largely derived from oral 
work. But the conditions here are very different from those in à 
brief viva voce. The volume of questioning is far larger, the emotional 
atmosphere less strained, and the oral answers are supplemented by 
written work. 

Perhaps the best instance of the value of the interview is provided 
by the work of vocational psychologists. In advising a child on his 
career the psychologist applies several objective tests of aptitudes, 
and takes both medical and educational records into account. But 
his information as to the child’s personality and interests is derived 
chiefly from an interview. This is conducted informally so as to put 
the child at ease, and though it follows no stereotyped plan of ques- 
tioning, it has, in fact, been designed very carefully so as to elicit 
the child’s main traits. A further function of this interview is that it 
enables the psychologist to interpret the test results and other 
objective records, and to realize their significance in a synthetic view 
of the child as a whole. Guidance, which is based merely on the 
mechanical application of tests or examinations, is not successful. 
But there is ample proof that the guidance given by trained psycho- 
logists who adopt these test and interview methods does possess 4 
high degree of predictive validity. Again in vocational selection for 
skilled trades, the psychologist can often use the interview success- 
fully for assessing the amount and the relevance of previous trade 
experience, though—when conditions permit—he also constructs 
and validates a battery of selection tests (cf. ftn. p. 145). For higher- 
grade occupations such as managers, officers in the Services, an 
teachers, where tests seldom seem to be of any value, the technique 
of observing small groups of candidates (as developed by War 
Office and Civil Service Selection Boards), combined with a psycho- 
logist’s interview, is the most successful (Vernon, 1953). À 

Tt would seem, then, that the interview, as at present employed in 
educational circles, is practically valueless, but that in skilled hands 
it may become a most useful tool. There is no reason other than 
expense and the scarcity of competent psychologists why the methods 
of vocational selection and guidance should not be applied in educa- 
tion, both for choosing pupils most likely to benefit from higher 
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education, and for advising them as to the lines of study for which 
they are best fitted. This is already common practice in many 
American and some Australian Universities. It is occasionally used 
here at the 11+ selection stage, though necessarily confined to a 
small proportion of borderline candidates. 

Intelligence and Special Aptitude Tests.—Theoretically, intelligence 
tests should be able to fulfil one of the functions commonly assigned 
to examinations much better than examinations do at present, that 
is the selection of pupils or students with the greatest intellectual 
promise, as distinct from achievement. As we have seen, this is not 
because they really measure innate ability, but rather because they 
show the general level of thinking that the child has developed out- 
side school, which he should be able to apply in school. They should 
also be able to measure objectively the qualities which essay-type 
examinations are supposed to elicit, and which new-type tests are 
supposed to be incapable of eliciting, such as ‘originality, power of 
thinking, grasp of general principles, etc.’ Experimental investiga- 
tions indicate that they do have some value for such purposes, but 
that they need to be employed with caution. They seem to be most 
useful among younger children. The Terman-Merrill, applied at 6 or 
7 years, correlates very highly with educational achievement for the 
next few years; coefficients of + 0-80 or more may be expected. At 
the 114- selection stage, the group intelligence test usually gives 
slightly better correlations with subsequent achievement than any 
other single instrument. Within a selected grammar school popula- 
tion, correlations would be expected to drop to about + 0:40, hence 
there will naturally be many discrepancies between I.Q. and bright- 
ness as observed by the school teachers. But when these teachers 
criticize tests they fail to realize how very much worse most of the 
pupils with I.Q.s of 110 down to 70 would be. In the modern school, 
correlations remain relatively high. 

At later ages, the agreement between general intelligence tests 
and academic work is usually lower still, because of the still greater 
restriction of range. In American universities, with their more 
heterogeneous students, coefficients still average around 0-4 to 0-5 
(cf. Eysenck, 19475). But here the figure is usually nearer 0:2 to 0:3 
With degree results. Similar correlations are found among Training 
College students with their combined theory marks; with practical 
teaching marks there is little or no agreement. 

More useful predictions at late adolescent and adult levels are 
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likely to be reached through testing special aptitudes or group fac- 
tors. Prognostic tests are intermediate between intelligence and 
attainment tests. They try to show, not how much mathematics OF 
English the pupils have previously acquired, but how well they can 
apply their intelligence plus acquirements to fresh mathematical on 
linguistic problems. Earle (1949) has had some success with English; 
Science, and mathematics tests, and with his Duplex tests, in diag- 
nosing success at different types of secondary courses. Many Ameri- 
can colleges give ‘placement’ examinations to incoming freshmen, 
consisting of linguistic and numerical -++ non-verbal intelligence tests, 
together with English, mathematics, science, foreign language ап, 
social studies tests, in order to determine the students’ main aptitudes: 
There is room for considerable development of such work in our ON 
schools and universities (cf. Dale, 1954). 

Cumulative Record Cards.—The primary function of many exam 
inations is to provide a certificate of work done. Ostensibly wr 5 
true of the General and Scottish Leaving Certificates, and oun da 
sity degree examinations, though they are also often interprete 1 
prognostically. Now, in most schools the teachers know quite wel 
what work each pupil has done. Hence there is every reason fos 
making more use of teachers’ records of this work and less use 0 
external examinations. The former cover the whole ground, the latter 
never more than a fraction of it. The former are collected Ed 
everyday conditions, the latter when many pupils are in an опоны 
emotional state due to cramming and fear of failure. The former ar 
also less likely to produce the degrading effects upon teachers 47 
pupils’ educational values which are characteristic of public COR 
petitive examinations. Teachers? markings of school work are, 9 
course, just as liable as examiners’ marks to subjectivity and M 
reliability ; perhaps more so, because of the prejudices they ате ШК 
to possess regarding each pupil’s character and ability, which may 
influence their judgment of his written work. But if in the course © 
his career a pupil is taught by several different persons, each of whom 
records his achievements according to some uniform system, the 
cumulative record should yield a fairly reliable picture. Although, 
as stated above, the function of such records should be to certify 
work done rather than to predict future promise, yet the investiga- 
tion by the Scottish Council for Research in Education (1936) into 
the Leaving Certificate showed that teachers’ marks did in fact cor- 
relate with subsequent university success just about as well as did the 
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external examiners’ marks. Likewise in McClelland’s (1942) and 
several subsequent studies of secondary school selection, scaled 
primary school teachers’ estimates have given better predictions of 
grammar school work than have objective attainments tests. The 
many other uses to which cumulative records can be put, such as 
vocational and educational guidance, have been fully described by 
Hamley et а! (1937) and Walker (1955). 

The objection which will at once be raised to such schemes is—how 
are the standards of different schools to be compared? McClelland 
(1942) demonstrated the enormous discrepancies between teachers’ 
assessments of their pupils and the levels actually obtained by the 
same pupils on objective tests, some teachers being far more generous 
than others. Indeed unless some system of scaling or standardization 
of assessments is adopted, the inclusion of such estimates in a selec- 
tion procedure is known to lower its validity. Several possible 
solutions to this difficulty have been tried. 

Valentine (1938) proposed that at 11 + one or more general in- 
telligence tests should be given in all schools under one authority. 
If, say, one thousand grammar school places were available, and an 
Г.О. of 115 was found to cut off the top thousand pupils, then the 
number of pupils in each school scoring 115 and over constituted 
that school's ‘quota.” Meanwhile the head and staff of the school 
would have ranked their own pupils in order of suitability for 
grammar school education. This rank order and the I.Q.s would 
finally be combined with appropriate weights. Each school would 
send its allotted quota to the grammar school, but the precise pupils 
included in the quota would be those whom the primary teachers 
had chosen as most able, not merely those with the highest I.Q.s. 
While this quota scheme has, in fact, been used in certain urban areas. 
it is quite unsuitable for areas where there are numerous small 
schools, each submitting a dozen or fewer candidates. Even with big 
schools it is rather unreliable since, as shown above (p. 209), a percen- 
tage or proportion has a very large Standard Error. 

An alternative plan, which is receiving experimental trials, is to 
fix each school's quota initially at the same number as gained gram- 
mar school places in the previous year (or recent years). A few pupils 
who are ranked by their school as just above or below this borderline 
are examined, and the quota is adjusted up or down according to 
their results. This method is referred to by the Ministry of Educa- 
tion as selection without any external examination—not even an 
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intelligence test. We have already pointed out that fairly Con 
Variations in standards may occur, by chance, from year to m 
(P. 63). Thus it is possible that the method, though admirable 4 
intention, may break down through the necessary quota adjustmen 
becoming too numerous and unwieldy. r labo- 
Many American Universities employ what is, essentially, an eta a 
ration of the same scheme. They record the final secondary E. 
marks of students coming from various contributory schools ЊЕ 
several years, together with the performance of these students at te 
university itself. Eventually it becomes possible to determine Je 
relative standards of the secondary schools concerned. For AM 
it may be found that students with 60% from School A are a nts 
good as those with 80% from School B. Thus the numbers of stude 
worthy of selection from each school can be deduced. 9), the 
Probably the best planis that advocated by McIntosh et al (194 a 
National Foundation for Educational Research and others, of sca nal 
each school's own marks or rankings against some uniform E. 
set of attainments tests, by an adaptation of the techniques descri is 
above (pp. 54-8). Selection can then be based primarily on the We 
ers’ judgments, derived from internal school examinations. and E 
cumulative records of junior school work. But each set of. te. 
is calibrated or scaled up or down to accord with the distribu sts. 
obtained by that school's candidates on the external, CE di a 
An agreed weighting can be given to the objective tests also. The 
scheme too requires fairly large numbers per school, Bou fa 
Intosh finds it workable even among schools submitting only ikely. 
dozen candidates. It has the disadvantage that schools are still li rs’ 
to cram for the external tests, but it does take full account often ea 
knowledge of their own pupils. A useful discussion of these and КУН 
problems raised by the various selection procedures at 11 + is £i 
by Dempster (1954). а 
Methods A as е: require to be worked out and supervised ү 
a statistically trained psychologist. As described here, they might n te 
appear to be appropriate as substitutes for the General СЕЕ У 
or other higher-level examinations. And yet the accrediting syste d 
adopted by many Australian and New Zealand Universities S 
that something similar can be devised, which will largely banish t! 
strain and unreliability of externally-conducted examinations шор 
secondary school leayers, and yet provide а reaso сраза 
scholastic achievement and of suitability for university courses, 
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IMPROVEMENTS IN ORDINARY EXAMINATIONS 


Since external examinations will for a long time play a large, though 
we hope diminishing, part in education, and since even in an ideal 
system teachers must apply internal examinations, and mark their 
pupils’ work, we must consider carefully the improvements which 
might be made in the examination as it now is. From the investiga- 
tions we have described, and from the discussion in previouschapters, 
the following principles may be deduced. 

1. The teacher or examiner should formulate as precisely as pos- 
sible what it is that he wants to measure. This implies first a clear 
conception of the objectives or aims to which, in his view, education 
should be directed. At the primary school level he may be content 
with the view that education consists in implanting certain tradi- 
tional knowledge in children's minds, such as the way to work out 
sums and to parse sentences. This may be one objective in the secon- 
dary school and university also, though surely not one of the most 
important. History, science, foreign languages, and the like do involve 
the acquisition of a certain body of facts—dates, chemical formulz, 
meanings of words, etc.; but they are supposed to possess many 
additional values. Taste for literature, interest in foreign peoples, 
scientific habits of mind, ability to apply knowledge to the interpre- 
tation of present-day events, are but a few of the things which we 
hope to inculcate. In making such an analysis, the teacher must be 
careful to take into account the facts which psychologists have 
established in regard to transfer of training, namely that general 
faculties, like observation, memory, and reasoning, will not be deve- 
loped merely by the study of some particular school subject, and 
that indeed these faculties have little real existence. For example, a 
rational and scientific approach to contemporary problems can be 
developed by studying typical problems rationally, by conducting 
discussions and projects, by showing how irrationalities of thinking 
arise, and so on; but not by the carrying out of stock experiments in 
physics, nor by memorizing the rules of grammar or logic. A Be- 
havioristic formulation should be helpful: instead of talking in terms 
of vague mental qualities, we should ask—what can the educated 
Pupil or student do in each specific field of study, or in the world at 
large, which the uneducated one cannot do; and what are the precise 
elements of our teaching which are supposed to confer upon him 
these capacities? Ў 
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Having formulated as candidly as possible our general philosop’ hy 
of education, we can ask which of these objectives are to be examined. 
We may decide that desirable moral conduct and social adjustment 
cannot be expressed in written work. But each objective should be 
considered from the point of view—what can the educated pupil 
write, or what problems can he solve, which the uneducated one 
cannot. 3 

The external examiner, who has not been responsible for teaching 
the examinees, should adopt a similar procedure in deciding what ПЫ 
examination is meant to measure. He needs, however, to be especially 
careful to emphasize ‘can’ rather than ‘ought’ in his analysis. It 3 
merely futile for him to complain that teachers are not doing their 
work properly, and to insist that examinees must conform to intel 
tual standards which may be current among persons like himself, bu 
which are quite unattainable by children or immature students. P 
job is to set tasks which will differentiate the better pupils who Es 
in real, not ideal, schools, from poorer ones. If he is supposed to PIC | 
out, say, the best 30%, but formulates objectives which even the bes 
5% can hardly achieve, then he is not doing his work efficiently E 
is creating unnecessary difficulties for himself when the time 7 
marking arrives, d 

A group of persons, such as a head master and his staff, or à Mes е 
of examiners, should be able to carry out such analyses much Ee 
logically and thoroughly than can any one person, hoWever gea 
headed he may be, since they can criticize one another’s ideas. A 

Finally, the conclusions of this analysis should be communicate 
by the examiners to their prospective examinees, or by the bur. 
to their pupils. At the present time examining frequently degener? 
into a battle between the examiners who try to catch out their yc 
and examinees who try to find out what their examiners really wan’ 
and what are their personal prejudices, by conning the papers ш 
have set in the past. All this is a stupid waste of everybody's ns 

2. Having decided upon the abilities that he wishes to measure. 
the teacher or examiner should consider the form of examination 
which will be most appropriate for each of them, French сопуства“ 
tion, qualitative chemical analysis, and other capacities obviously 
require special types. But in most instances the important point to DE 
settled is whether the examination should be new-type, essay-tyP® 
or a mixture. Two pertinent considerations in this connection are the 
number of examinees, and the examiner’s own skill in constructing 
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new-type questions. If the numbers are very small, he may be justified 
in shirking the extra expenditure of time required by objective exam- 
ining. And he should know from past experience that he is more 
successful in devising new-type measures of some things than of 
others. 

Generally speaking, any kind of factual information, or straight- 
forward skills, such as the carrying out of arithmetical operations or 
grammatical analyses, should be examined in new-type form, instead 
of being left to the vagaries of subjective marking. This need not 
mean that only trivial and unrelated items of knowledge or skill 
should be included. Good questions, as we have seen, can bring out 
the examinees' ability to interpret the significance of facts, grasp of 
general principles, and application of such principles to specific new 
problems. True, there are many dangers in these more elaborate 
items, but they are hardly as serious as the dangers of trying to 
discover a// the underlying abilities of the examinees from their 
answers to essay questions. 

We admit, however, that many important aspects of ability may 
be displayed more clearly in essay-type answers or (with subjects 
such as mathematics and physics) in the working out of long and 
complex problems. Literary style of composition, imaginative 
qualities and originality, evidence of the examinee's interest in the 
subject and of his own thinking about it, are some of the things 
which the examiner may wish to elicit, even if he can never hope to 
measure them perfectly reliably. If the examination is designed 
primarily to bring out qualities such as these, then the examinees 
should be told so, and should be informed that amount and accuracy 
of information will not, in this instance, be taken into consideration. 
Frank directions about what is wanted are particularly desirable 
when the same examination includes some new-type and some 
essay-type questions. 

3. If, for one reason or another, essay-type questions have to be 
maids-of-all-work, then the examiner should ask himself which parts 
of the total field to be examined are the most important. Knowing 
that his questions can sample only a fraction of it, he should avoid 
the less essential parts, even at the risk of his questions being easily 
'spotted' beforehand. Overlapping among the questions is obviously 
undesirable also, since it is wasteful to cover any one part twice over. 
The unrepresentativeness of single papers, and the desirability of 
improving reliability by increasing their number should be borne in 
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mind. With each question, the length of time needed for a reasonable 
answer should be considered, so that the examinees shall not be set 
too much, nor too little, to do in the period allowed. 

Much investigation has been made into the kinds of essay ques- 
tions which are least unreliable, particularly by the New York 
College Entrance Board (cf. Lindquist, 1951). It has been found that 
rather short discussion questions, each of which poses 2 definite 
problem and supplies specific indications of what is wanted, а10 the 
best. Such questions still give plenty of scope for the examinee to 
reveal his powers of marshalling and organizing his material. The 
much more vague, yet very popular, type of question which allows 
greater freedom of response to the examinee only leads to confusion 
and difficulty in comparing one examinee with another. NEM 

4. Perhaps the most important quality in a good examiner 15 his 
ability to foresee the sort of answers he will get, both through his 
insight into the minds of pupils or students, and through his experi- 
ence of their answers in the past. If in formulating each question he 
Possesses a clear conception of the probable answers—800°° 
average, and bad—the following advantages are likely to accrue 
The questions will be simply worded, free from ambiguities, y 
they will not be interpreted in such divers ways that the answers а E 
too heterogeneous to be compared, or marked consistently. T С 
questions will also be appropriate in difficulty, not, that is to say» Г d 
type of questions which no one but the examiner himself n 
answer adequately. Most important of all, the production of ау, = 
to these questions will really bring into play the abilities which t Ж 
examination is aiming to measure. When, later on, the exam 
reads and marks the answers, he should continually be trying Я 
improve this ‘faculty,’ by noting how far his expectations 210 p 
filled. Every question may be regarded as an experiment in forecasting 
which, although never likely to be completely successful, may b* 
come progressively more so with practice. 

In this, as in all other branches of examining, two opinions д 
likely to be better than one, three than two. A board of four or five 
will perhaps be the most efficient. 

5. When essay-type questions call for answers of a largely factual 
nature, a detailed scheme of marking should be drawn up. This 15 
especially necessary when several different examiners each mark @ 
certain proportion of the papers. Experiments indicate that it wil 
reduce their variability by roughly one-half. But a single examiner 
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will also greatly improve his consistency if he decides beforehand 
just what points to look for, and marks these objectively. 

In assessing the more general qualities which essay answers are 
supposed to bring out, the quality scale method (cf. p. 34) should be 
adopted. The examiner should always be on the alert to ensure that 
he is judging the scripts solely on the basis of the criteria which he 
has previously formulated, that he is not being biased by bad hand- 
writing or spelling, by some trick of style which jars on him, or by 
some personal predilection regarding the subject-matter. All marking 
which cannot be done objectively should consist largely in intro- 
spection: “Why does this strike me as good or bad? Are these the 
qualities which I am supposed to be judging, and are they the qualities 
which I took into account when I selected my standard specimens?” 

Many questions may advantageously be marked both by the 
qualitative and by the objective (count) systems, the figures then 
being combined in due proportion. If two or more examiners mark 
the same, or even a few of the same, scripts, it will be particularly 
valuable to analyse the reasons for any discrepancies in marks, so as 
to clarify still further their criteria of good and bad. 

One other source of prejudice often arises in marking if the 
examiner is personally acquainted with some, or all, of the examinees. 
Either he should carefully avoid looking at the names on the papers, 
or, if he cannot help recognizing their writing or style, should do his 
best to disregard any previous impressions he may have formed 
about them. 

6. With the statistical aspects of marking we have already dealt 
in Chaps. IIV. But the following main points in subjective mark- 
ing (as distinct from count scoring) may be recalled. Marking is first 
and foremost the arranging of scripts in order of merit, or grouping 
them into categories of merit. It is not an attempt to give an absolute 
judgment of the value of each script. All the answers to one question 
should be classified into such categories before starting on another 
question. The categories should be, approximately, evenly spaced, 
and should not exceed ten in number. The frequency distribution of 
marks for each question should tend to the normal shape, and the 
average mark should be very close to 5 (assuming the middle cate- 
gory of merit to have been labelled 5). The dispersion of the distri- 
bution should also be approximately constant for all questions, 
unless it is desired to weight some more heavily than others. The 
final total scores for the scripts may be conyerted into an appropriate 
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distribution of percentage marks by any of the three methods given 
in Chap. IV. АП decisions as to standards of passing and failing should 
be postponed until the process of marking has been completed. 

The final step in many examinations is to reduce the marks to two» 
three, or more categories (Pass or Fail; First, Second, Third Class, 
etc.). Here it is most desirable that all scripts which fall within a 
few per cent. above or below any important borderline should ђе 
read a second time, preferably by more than one examiner, 50 as А 
ensure that the eventual decision shall attain the utmost reliability 
of which fallible human judgment is capable. Е 


ANSWERS ТО THE EXAMINATION ON PP. 213-19 


РА 28. None. 
2. 6. 29. A. 
3. 150. 30. B. 
4. 105 (104 or 106 may also 31. 69 or 62. 
count). 32. — 4 or — 3. 
5. 80. | 33. C. 
(у Balls 34. D. 
7. 32 (or 31). 35. B. 
8. 17 (or 15-16). 36. D. 
9. positive (or plus). 37. B. 
10. faculties. ВЕС 
11. Spearman. 39. А. 
To vs 40. None. 
13. specific. 41. D. 
14. g. 42. E. (In Nos. 42-41 on 
15. group (or common). 43. D. point is scored by ea° 
16. k, k: тог F. 44. CF. answer in which one or 
17. +. (In Nos. 17-25, sub- 45. A. both letters are give? | 
18. +. tractonemarkforeach 46. CA. correctly. Nothing 15 
19. —. incorrect answer. No 47, EC. scored by any answer 
20: +. guessing ^ correction 48. D. which also includes ап 
21. =. need be applied in sub- 49. B. incorrect letter.) 
22. —. sequent questions.) 50. G. (For scoring of NOS- 
ү cr 51. A. 48-54, see p. 242.) 
24. +. 52: В. 
25. —. SEIS 
26. C. 54. C. 
этед 
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primary abilities tests, 185 

Tomlinson, T. P., intelligence tests, 
181, 184 

Tulchin, S. H., 164, 264 


Valentine, C. W., 208-9, 251, 264 
intelligence and reasoning tests, 160, 
180, 186 
Vernon, H. M., 187., 264 
Vernon, P. E., 197., 68, 132, 135, 139, 
144n., 147, 151, 157, 161, 180, 181, 
186, 187, 193, 202, 247, 248, 264 
educational tests, 173, 177 
Study of Values, 187 
Vincent, D. F., mechanical test, 185 
Vineland Social Maturity Scale, 187 


tests, 174, 176, 264 
performance test, 181 
spatial test, 184 
Wechsler, D., intelligence tests, 68, 72, 
153, 157, 160, 161, 162, 164, 168 
170, 179, 194; 264 1 
Weiss Long, A. E., 186, 265 
West Riding intelligence test, 184 
West Yorkshire intelligence test, 184 
Whilde, N. C., 1541., 265 
Whipple, G. M., 186, 265 
Wilmer, E. N., 186, 265 
Wilmut, F. S., 209, 260 
Wing, H. D., Music tests, 186, 265 
WISC, see Wechsler 
Wiseman, S., 205, 265 
intelligence test, 184 
interest test, 187 
Wood, B. P., 203, 265 


Yates, F., 85, 261 
Yule, G. U., 107 
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aX discriminable grades of, 31, 34, 
factorial analysis of, 112, Ch. VIII, 


goodness and poorness of, 24-5, 
32-3, 53, 56, 58, 62-5, 67, 150 
measurement by sampling, 20, 87, 
127, 131-2, 152-3, 200-2, 212, 227, 
229, 255 
methods of scoring, Ch. II 
nature of, 19, 130-2 
numerical evaluation of, 18-19, 27, 
28-31, 33-5, 56-65, 149, 203 
qualitative grading of, 11, 26-8, 32, 
34-5, 205, 257 
rate and power plans of scoring, 21, 
22-3, 155 
units in measurement of, 3, 21—5, 30, 
59-60, 65, 70-1, 194 
Accident proneness, 17-18 
Accrediting, 252 
Achievement tests, see Educational 
tests 
Age, allowances, 53, 1957. 
determination of average, 39, 43-4 
factor in correlations, 120-2, 128, 
131, 134 
growth of abilities with, 69-71, 135, 
152, 157n., 209-10 
ranges of tests, 159, 160, 162, 171-2, 
188-9 


tabulation of, 5 
American examinations, 198, 252, 256 
new-type, 211 
American tests, 170, 172 
British performance on, 67, 75, 81, 
149, 159, 160 
construction of, 189-90 
for secondary schools, 155 
prognostic, 250 
purchase of, 170 
Analysis of variance, 85-7 
Analytic schemes of marking, see 
Marking 
Arithmetic ability, group factor of, 
131-2, 126, 137, 138, 143 
measurement of, 19-20, 25, 132, 135, 
207, 211, 255 
new-type examinations of, 232, 240 
range of, 63, 65-6 
tests of, 21, 153, 154, 164, 174, 177-8 
Arithmetic Age, 69 
Attainment tests, see Educational tests 
Attenuation, 129, 134n., 209 
Attitude tests, 19, 187 


Australian tests, 170-2 
university entrance, 249, 252 
Average, see Mean 
Average. Deviation (A.D.), see Mean 
Variation 


Bi-factor analysis, 140-1, 147, 152 
Bimodal distribution, see Frequency 
distributions 


n А 1 
Character traits, factor analysis of, 14 
tests and ratings of, 13, 19-20, 10% 
126, 144, 148, 154, 186-7, 197-8, 
246-8 Д 5 
i idance, use of tests 1n, 39, 145, 
сша, 154, 161, 164—5, 168-9, 192, 
193, 208 ak 
Children, effects of examinations оп, 
199, 246, 250 > 
with language handicaps, testing of, 
158 
young, testing of, 158, 159, 160-1, 
162, 164, 168-9, 191 9. 33 
Children’s attitudes, to marks, 29, 52 
ОА. examinations, 213 
to tests, 158, 161, 164-5, 169, 191 
Chi-squared (7° ), 6n., 88-91, 1 Ju 
Civil Service, selection for, 135, - 


4 
Clerical ability tests, 185 У 
Coarse grouping, effects on correla: 
tions, 98, 101 

on means, 42 , 

on standard deviations, 100 
Coefficient of Association, 107- 
Colligation coefficient, 10 
Colour blindness, tests of, 186 
Column diagram, see Histogram d 
Combination of tests, see Marks а! 


retrial 137, 142n. 
Communality, E 
Comparison of abilities, see Marks and 
Scores 
Compensation theory, 130, 133-4 
Concept formation, 146, 157 10 
Confidence limits, 78, 80, 83,1 
Contingency coefficient (C), 104-7, 
108 
i сан 112, 
Correlation, applications of, 92, » 
"218-29, Chs. VIL УШ, 152-2: 
201, 209, 224, 227, 247, 249 
attenuation of, 129, 1347., 209 
average, 124—5 
between categoric variables, 104-8 


INDEX OF 


Correlation (cont.) 
biserial, 108-9 
combining and contrasting coeffi- 
cients of, 110-11 
effect of, upon reliability formulae, 
82, 84-5 
forecasting efficiency, 112, 118-20, 
123, 209 
from regression lines, 115 
from small populations, 76, 109, 110 
in homogeneous populations, 122-3, 
128, 139, 201, 205, 249 
influence on, of irrelevant factors, 
120-3, 131 
interpretation of, 118-23 
inverse, see negative 
Kendall's method, 104 
multiple, 124-5, 1451. 
паше and objects оѓ, 2, 924, 112, 
negative, 94, 107, 132, 134 
partial, 121-2, 131, 136, 139 
Pearson-Bravais method, 94 
Probable Errors of, see Reliability 
product moment methods (r), 92, 
94-102, 108, 109, 115 
rank order method (р), 26n., 92, 
103-4, 110 
reliability of, q.v. 
r$, 106-7 
Spearman foot-rule method, 104 
tetrachoric (у), 108 
z technique, 110-11 
Correlation Ratio (7), 104-5 
Count scores, 3, 20-5, 28-30, 212, 257 
Critical Ratio (C.R.), 80, 82, 85 
Cross-classification of pupils, 92, 136 
Cumulative frequency graph, 77. 
Cumulative record cards, 163, 187, 
250-1, 252 


Deciles, 37-8, 67 
А of Freedom, 857., 89n., 90n., 


Developmental Quotient, 1577. 
Па Ето но tests, 154, 170, 174, 177, 


Dictation, see Spelling 
Dispersion of measures, 1, 16, 36, 40, 
_ 45, 53-4, 62-3, 65-6, 81, 87, 123-4 
indices of, see Mean Variation, 
Range, Standard Deviation 
Distractions in testing, 191, 192-3 


Drawing ability, tests, 135, 136, 155, 


Educational ability, general factor of, 
137, 139 

Educational Age (E.A.), 69, 70-1, 171, 
188-9, 194-5 
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Educational backwardness, 71, 121, 
158, 161-2 


Educational guidance, 3, 31m. 
144—5, 154, 163, 167, 170, 15466 
198, 248-9, 251 ў 

Educational objectives, 198, 253 

Educabonal Quotient (E.Q.), 71, 157-8, 


Educational tests, 21, 317., 153, 154— 
163, 173-8, 227 ` Y 
choice of, 153, 188-9 
construction of, 153, 189 
correlations between, 134-5 
functions of, 154—5 
norms, 69—73, 154 
reliability of, 126, 127-8, 1347., 150, 
227 
validity of, 150-1, 228, 250 
See also Arithmetic, Reading, etc. 
English ability, factorial analysis of, 
136, 139, 142 
tests of, 75, 153, 154, 157, 176-7, 
234, 242, 250 7, 
See also English composition, Read- 


in 
English composition, marking of, 13, 
21, 25, 27, 28, 29, 31, 34-5, 201-7, 
211, 256 
range of ability in, 65-6 
subjects for, 64, 149, 256 
tests of, 135, 155, 175 
training in, 199, 224 
variations in ability at, 63, 201—2, 205 
Examinations, age allowances in, 53, 
195n. 
as commercial qualifications, 197-8, 
210 6 
as pedagogical aids, 32, 198 
coaching for, 199, 208, 210, 212, 223, 


250, 252 

criticisms of, 148-51, 154, Ch. XI, 
223, 226, 250, 252, 254 

ease or difficulty of, 24—5, 33, 63-6, 
230-1, 245, 246, 254, 256 

functions of, 154, 197-8, 208, 246, 
253 

improvements in, 253-8 

marking of, 206-7, 220, 227-8, 
256-8; see also Marking of scho- 
lastic work 
W-Lype, 4.7. . 

of дя араа of candidates, 53, 
220 d 

of small numbers of candidates, 35, 
58, 66, 252, 255 

oral, 246-7. 

passmarks in, 29, 32, 35, 201, 206, 
212, 258 

reliability of, 2, 128, 200-8, 210, 223, 
221-8, 241, 250, 255-6, 258 
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Examinations (cont.) 
setting questions for, 64, 148, 151, 
220, 227, 253-6 
substitutes for, 246-52 
ganay of, 2, 87-8, 151, 208-10, 


Examinations, Baccalaureat, 205-6 

Civil Service, 135, 202, 247 

General Certificate, 2, 135, 197, 
198-9, 202, 206, 210, 250, 252 

Leaving Certificate (Scottish), 198, 
209, 210, 250 

Matriculation, 2, 197, 210 

Qualifying (Scottish), 31, 71 

Scholarship, 108, 206, 208-10, 228 

School Certificate, 27, 203-4 

баса (internal), 53-66, 197, 252, 


Secondary School selection, q.v. 
Special Place, 203, 208 
панз College, 29-30, 58, 135, 


University, 30, 35, 58, 62-6, 148, 
A 197-8, 201, 202, 203-4, 209, 


F-ratio, 87 
pactar patterns; 138, 140-2, 144, 146, 
effect of heterogeneity on, 139 
... effect of tuition on, 138, 144 
Factorial analysis, Ch. VIII 
limitations of, 138-40, 146-7 
techniques of, 136—7, 139, 143, 147 
uses of, 20, 130, 135, 138, 144–6, 
152, 168—70, 224 
Factors, common, 120, 137, 141 
D RUE 131, 136, 1377., 139, 140-4, 
group, 131, 136-45, 147, 151, 152, 
161, 169, 195, 226, 250 
multiple, 140—1, 143-4, 146, 147, 152 
specific, 137-8, 140-2, 144, 145, 150, 
151, 161 
Faculties, 20, 130-2, 138, 143, 145, 224, 
53 


Formal discipline, 199, 253 
Free word association tests, 39 
French, tests of, 178 
Frequency distributions, asymmetrical, 
6n., 15, 17, 31, 104 
bimodal, 15 
combining and comparing, 53-62, 
123-4 
curves, area of, 11, 16 
definition of, 4 
dissimilarities in, 54 
graphical representation of, 7-11, 
12-17 


INDEX OF SUBJECTS 


Frequency distributions (cont.) 
Eos 13-15, 18, 24, 29-30, 35, 
90-1, 104 MÀ 
normal, see Normal distribution 
of classified measures, 5-7 
of correlations, 76 
of differences, 76-80 
of means, NEETA 
rectangular, 26, 
skew, 15, 17-18, 23-4, 29, 30, 31-2, 
39, 54, 57 ч 
statistical methods, for handling, 
Ch. Ш, 54-8 
tabulation ОБ ai E 
Frequency polygon, 7-11, 13- 
EUndeHon Hüctüation, 128-9, 137n., 200, 
209-10, 227 


g factor, 142-6, 152, 161, 168, 169 


Gaussian distribution, see Normal 
SERM TUA 1, 90-1 
Goodness of fit, 50-1, 90- ~ УМ 
intelligence tests, application 
SE 68, 162-3, 164, 167, 190-2, 
tteries, 142, 156 
batteries, h 
choice of, 153, 188-9 
classified list of, 153, 181-4 у 
compared with individual, 162-6, 
167-70 Р; 
construction of, 12 155-6, 188-9 
differential, — 
distributions of scores on, o 32, 
50-1, 81, 83-4, 90-1, 188-9... 
effects of health and mood on, D 
162, 164-5, 169, 191 TP 
effects of practice and coaching оп, 
127, 157n., 165-7, 191, 195 TE 
scoring of, 156, 164, 189-90, 5S 
time limits, 132, 156, 163-4, 19 
validity of, 151-3, 167-9, 249 
See also Mental tests 


Handedness tests, 186  . 

Handwriting ability, distribution of, 17 
factorial analysis of, 136 
marking of, 25, 29, 34 
tests of, 39, 135, DAIRA 122-3 

Het neity of populations, 7 
1 “198, 135, 139, PUR 204n., 205, 209» 


24! 
Histogram, 7-11, 13 ü m 
tanga students, superior ability of, 
16, 81-2, 209 


7 49 
Intelligence, age and, 69-71, 94, 2 
definition and nature of, 19, 130, 142, 
146, 152, 156-7, 168-9 В 
economic status and, 68, 70, 74-9» 
121, 158 
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Intelligence (cont.) 
educational attainments and, 69, 92, 
108, 122, 125, 139, 144, 146, 153. 
156-8, 164, 168-9, 194, 210, 
249-50 í 
hereditary and environmental influ- 
ences on, 68, 156-8, 167, 249 
language handicaps and, 158 
nationality and, 68, 158 
physique and, 134 
sex and, 16, 76-80, 1571. 
subjective judgments of, 139, 148-50, 
151-2, 196 
vocation and, 144-5, 146, 157, 195 
Intelligence Quotients (1.Q.), distribu- 
tion of, 15-19, 49-50, 52, 72-3, 195 
interpretation of, 19, 22, 36, 71-3, 
193-6 
prediction of, 112-18 
Stability of, 126-7, 156-7, 165, 168, 195 
standard deviation of, 52, 71-3, 
157n., 195 
Intelligence tests, see Group, Non- 
verbal, Terman-Merrill, etc., tests 
Interests, factorial analysis of, 141, 147 
intellectual, 134, 139, 141, 144, 146, 
157, 246 
tests, 187 
Internal consistency, 152-3 
International Examinations enquiry, 
203, 205-6 
Interviews, 246-8 


kork:m factor, 144, 161 


Language development tests, 174, 176 
etter grades, 1 1, 26-8, 205 
Inguistic and literary abilities, see 
English, French, Reading, etc. 


Machine-scoring Of tests, 189-90 
anual abilities, factorial analysis of, 


132, 134, 136, 137, 142, 1 
tests of, 132, 135, 186 22 14 


arking of Scholastic Work, analytic 
28, 203-4, 207, 211-12, 223. 2364 

by ranking, 25-6, 27, 28, 30, 32, 58, 

„59, 65, 103, 257 

improvements in, 31-5, 58, 59-62, 
256-8 


Objective vs. subjective, 11—12, 20-1, 
28-30, 32-3, 35, 656,149,155, 2022 
3, 206-7, 211-12, 1226-8, 255, 257 

Standards of, 27-8; 33-5, 53, 58, 
63-5, 86, 150, 200, 202-3, 205, 
251-2 


teachers’ idiosyncrasies in, 6, 27, 29, 
33, 62-4, 202,257” 2 
See also Ability, Arithmetic, English 

composition, Examinations 


Marks and scores, combination of, 22, 

26, 28, 53-6, 59, 123-5, 204 

comparisons of, 2, 32, 53-4, 56-8, 
59-60, 69 

distributions of, 4-7, 11, 24, 26, 
28-30, 32-5, 48, 53-8, 59-62, 86, 
257 

scaling of, 33, 35, 56-62, 124, 202, 
230, 251-2, 257-8 

significance of, 24-5, 27-30, 33, 36, 

9-40, Ch. IV, 150, 210 

Weighting of, 54-5, 66, 123, 125, 244, 

257 


Mathematics tests, 177, 185, 250 
Mean, arbitrary or guessed, 42-5, 
46-8, 85, 95 
Mean, arithmetic, calculation of, 1, 6n., 
38-9, 40-5, 48 
effect of, in combining marks, 54-6 
S.E. of, 82-3 
Mean Variation (M.V.), 45, 48 
Measures, definition of, 3 
Mechanical ability (m-factor), 130-1, 
144, 145 
tests of, 184 
Median, 36-9, 45, 58 
Memory, 130, 132, 142, 143, 145, 168, 
193, 253 


Mental Age (М.А), 23, 69 70-1, 156, 


159, 160, 161, 162, 171-2, 188-9; 
194-5 


Mental deterioration, 145-6, 170 

Mental measurement, 19-20, 194 3 see 
also Ability 

contrasted with physical, 19, 21-4, 

30, 194, 196, 206 

Mental Organisation, 130, 132, 138, 
143, 146-7, 224-5 

Mental speed, 130, 132, 144, 146, 
163-4 


Mental testers, desirable qualities in, 
191-3 


training needed, 73, 153, 159, 160, 
162, 164, 190, 193, 196 
Mental tests, classified list of, 153-4, 
170-87 


construction of, 20, 23, 143, 148-53 

further developments of, 169—70, 250 

instructions, 148, 149, 159, 189-90 ү 

inter-relations Cen em 130-5; 

Factorial analy: 

d of, 67—73, 126-9, 149, 

151, 152, 159-60, 161—70, Ch. X, 
9. 


50 
25, 26, 28, 40, 55, 58, 66-73, 
79150 160, 161, 167, 171-2, 189, 
190, 191-2, 195 
igins of, 
о, 151, 250 
publishers of, 171 
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Mental tests (cont.) 
reliability of, 2, 87, 1217., 126-9, 131, 
137, 150, 164, 165, 168-9, 188, 
194-5 
scoring of, 149-50, 159-60, 164, 
189-90, 191-2 
standardisation of, 25, 67-8, 69—70, 
148-50, 153, 159-60, 195 
validity of, 2, 87-8, 118-19, 1235, 
129, 150-3, 167-9, 188, 249-50 3 
Mop factor analysis, 140-1, 143-4, 


ыш abilities, factorial analysis of, 
1 
tests of, 186 


New-type examinations, advantages 
over essay-type, 32, 211-13, 220, 
223-4 

applicability of, 20-1, 35, 66, 219-20, 
228, 239, 243, 255 

best-reason questions, 237-8, 239 

completion questions, 214, 228, 231, 
232-4, 244 

defects of, 151, 211, 219-28, 232-3, 
235-6, 240-1 

distributions of marks on, 29-30, 
32-3, 212, 220-1, 227, 230-1 

elementaristic tendency in, 223-5, 
226, 228, 235, 255 

guessing factor in, 220-2, 232, 235, 
236, 237, 244, 258 

irrelevant clues in, 225-6, 233-4, 
236, 240-1, 244 

matching questions, 218-19, 231, 
238-42, 244 

multiple-choice questions, 215-17, 
220, 222, 225, 228, 231, 234n., 235, 
236-42, 243, 244 

practical, 243 

principles of construction, 32-3, 153, 
212-13, 220, 224-6, 227-8, Ch. 
хш 

rearrangement questions, 219, 242—3 

reliability of, 210, 221-2, 226-8 

scoring of, 220, 234, 242-3, 244 

simple recall questions, 212, 213-14, 
221, 226, 231, 232-4, 239-40, 244 

specimen paper, 213-9, 258 

suggestive effect of false statements 
in, 222 

true-false questions, 214-15, 220-2, 
228, 231, 234-6, 244 

validity of, 151, 225, 226, 228, 245 

weighting of marks, 244 

Non-verbal intelligence tests, classified 

list of, 153 181-2 
factor analysis of, 144 
uses of, 158, 163, 250 
See also Performance tests 
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Normal curve of error, see Normal 


distribution z 
Normal distribution, algebraic equa- 
tion of, 49 » 
applicability to reliability, 76-80 
application to testconstruction, 23-4, 
31, 65, 188 . 
applications to school marking, 


17-18, 24, 26, 31-5, 55, 59-62, 205, 
257 
definition of, 17 
disturbing factors in, 17-18 
examples of, 17 
ordinates of, 108 
testing conformity to, 50-1, 90-1 
Norms, see Mental tests 
Null hypothesis, 88, 89, 105 z 
Numerical description of distributions, 
36-7, 39-40, 49, 56 
Numerical evaluation of abilities, see 
Ability 


Ogive graph, 7n. 

Omnibus tests, 156, 190 

MERE and two-tail significance tests, 
n. 

Oral group tests, 163, 181 


Passmarks, see Examinations 
Pearson-Bravais method, see Correla- 
tion 
Percentage marking, defects of, 28-31; 
see also Ability, numerical evalua- 
tion of 
Percentages, S.E. of, 84, 86, 90 
Percentile graphs, 56-61 
Percentiles, 36-7, 38-40, 49, 58, 156, 
171, 189, 194 к 
comparison of distributions by, 
56-61, 65 
sizes of units, 59, 67 А 
Performance tests, application of, 156, 
158, 160-2, 169, 180-1, 189, 193 
classified list of, 153, 180-1 
construction of, 148-50 
factor analysis of, 144 
improvement on, with age, 70 
limitations of, 126, 161 
scoring of, 21, 149, 155, 161, 189 
Perseveration tests, 104 61 
Personality, study of, 141, 147, 1 Od 
169, 193-4, 246, 248; see also 
Character, Temperament 
tests, 154, 186-7 3 
Physical abilities, distribution of, 17 
measurement of, 23 
tests of, 132, 186 
Roliticallopinions, distribution of, 


scaling of, 19 
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Practical ability, 134, 136, 144, 151, 
161, 168, 193 
factor (F), 1441., 161 Я 
tests of, 187 s 
Practical examinations in science, 243 
Practical teaching ability, 136, 146, 
249 


Practice, effects of, on intelligence 
tests, 127, 157n., 165-7, 191, 195 

Practice sheets, 156, 189, 191 

Probability integral tables and graph, 
49-52,.59, 60-2, 77, 79 

Probability theory, 51-2, 63-4, 77-80, 

09, 221 

Probable Error (P.E.), 78-80, 87n., 
109, 204; see also Reliability 

Product-moment method, see 
Correlation N 

Product scales, see Quality scales 

Projective techniques, 187 

Pupils’ mods see Cumulative record 
cards 


Quality scales, 34, 1497., 154-5, 175, 
178, 254 

Quartiles (Q, and Q4), 36-8, 40, 54, 56, 
58, 79 

Quota scheme, 251 


Range of measures, 1, 6, 36, 40, 45, 48, 
51, 55, 66; see also Dispersion 

Ranking of measures, 4, 103-4, 195; 
see also Marking 

Rank-order method, see Correlation 

Rapport in testing, 165, 191-3 

Rate and power plans, see Ability 

Reading” pol factorial analysis of, 


marking of, 29, 56 
tests of, 21, 22, 75, 135, 151, 153, 
173-5 
Reading Age, 69 
Reasoning abilities, 143, 146, 157-8, 
223-5, 253 
tests of, 179, 186 
Recognition type items, 144, 167, 169, 
212, 222, 225-6, 232, 234-43 
Regression, 113-16, 123, 133 
equations, 115-16 
multiple, 125 
non-linear, 104—5 
Reliability, coefficients of, 126, 129 
effects of function fluctuation, 128-9 
effects of heterogeneity, 128 
effects of length of test, 126, 127-8 
formulae for, 81-5, 109-10, 117, 
126-8 


of correlatio 4 
xeyrelations, 74, 76, 81, 87, 
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Reliability (cont.) 
of differences, 2, 74-5, 76-82, 84-5, 
86 


. of means, 74—5, 81, 82-3, 84 


of percentages, 84, 86, 90 

of predicted scores, 87-8, 117-19, 
^. 123, 126-7, 157 

of scores, 87, 126—7, 194 

of small samples, 85 

of standard deviations, 83-4 

of z, 110 

theory of, 74-80 

See also Examinations, Mental tests 


Sample test items, 156, 189, 191, 244 
Samples of the population, unselected, 
17, 67-70, 74-5, 80, 122-3 
Sampling errors, 63, Ch. V, 105, 
109-10, 126-7, 128-9, 1371., 138 
142, 195, 200-1 i 
Sampling of ability, see Ability. 
Scaling, see Marks and scores 
Scattergram, 92-4, 98, 101, 104, 105. 
112-4, 117, 133-4 | 
Scatter of measures, see Dispersion 
Scholastic abilities, factorial analysis 
of, 92, 134-6, 142, 144 
tests of, see Educational, Arithmetic 
etc., tests у 
Scientific approach to Psychological 
problems, 3, 74, 76, 78, 86, 130 
Scotland, test norms in, 68 
Secondary education, selection for, 
31, 71, 92, 119-20, 122, 124-5; 
154, 163, 164, 165—7, 195n., 197, 
198, 199, 205, 209, 223, 224; 228 
Sun Aer uate range (Q), 40, 45, 
57. 
Sensory-motor tests, 186 
Sex differences in ability, 2, 15-16, 75, 
76-81, 83-4, 157n. 
in cinema attendance, 89-90 
in milk-drinking, 84 
Simple structure, 141 
Skew distribution, see Frequency 
distribution 
Small sample techniques, 85 
Pensa (k-factor), 143, 144 
1 > 


tests of, 184 
сазара prophecy formula, 
а TR 168. 16 230, 132, 144, 
Spelling s Ability, factorial analysis of, 


measurement of, 22, 25, 26 
tests of, 21, 135, 175, 234 


Spread of measures, see Dispersion 


276 


Standard Deviation (S.D. or а), 36, 40, 
45-52, 87 
differences between, 83 
of averaged measures, 123-4 
use of, in combining marks, 54-6, 
59-62, 65 
Standard Error (S.E.), 78, 87n., 88; 
see also Reliability 
Standardisation, see Mental tests 
Standard scores, 55-6, 59-62, 65, 67, 
69, 73, 115, 123, 156, 171, 194 
Standards of marking, see Marking of 
scholastic work 
Statistical methods, definition of, 1,2-3 
limitations of, 2-3, 80, 138-9, 146-7 
objects of, 1-3 
Statistical significance, 79-80; see also 
Reliability 


t, 85, 87, 110 
Tabulated measures, average of, 40-5 
correlation between, 98-102 
percentiles of, 38 
standard deviation of, 47-8 
Tabulation of measures, 1, 3-7 
Television viewing, 105-7 
Temperament tests, 19, 126, 161, 186—7 
factor analysis of, 141, 144, 146 
Tetrad difference criterion, 142 
Time limits in testing, 20-1, 132, 156, 
163-4, 171, 190 
Training College students, abilities of, 
39-40, 75-6, 80, 86, 135, 136, 139, 
155, 249 
Trustworthiness of statistical results, 
see Reliability 
Two-factor theory, 141-5 


INDEX OF SUBJECTS 


Unitary factor patterns, 141-2, 147, 
152 


United States, see American, Г 

University education, selection 08 
students for, 2, 122, 135, Ve 198, 
208-10, 228, 247, 248-9, 2. 


Validity, see Examinations, Mental lt 
Variability, see Dispersion, ES 5 129, 
of artament aye 17, 62-3, 

131, 135, 2002 . Ў, 5 
Variables, categoric, 88-90, 104-7, 10 
definition of, 3. 3, 5, 50 

discrete and continuous, 5, > 
traits and abilities as, 19 
Variance, 87 143 
ili or у: ed factor), 142, 
ТРАЕ 193; see also 
English ability 4 АЫ 
Verbal tests, see Group intelligence 
tests KA 
Viva 8 see Examinations, one 
Vocabulary, see Language ' 
ment, Stanford-Binet, el ü 
Vocational abilities, factor an 
136-7, 145 163, 
Vocational guidance, 144-5, 154, 
170, 248, 251 20, 125, 135, 
Vocational selection, s , 125, 
145m. 161, 210, ЗУ 8719, 125, 126, 
tional tests, 55, 5 
Varia 154, 184-6, 189 


i ds, 248 
War Office Selection Boards, 
Weighting, see Marks and scores 


X-factor, 144 
z-techniques, 110-11 


lysis of, 


- 


PERSONNEL 
SELECTION IN THE 
BRITISH FORCES 


PHILIP E. VERNON, M.A., Ph.D. 
«o and | 
JOHN B. PARRY, B.A., Ph.D 


This book gives a full account of 
the application of psychological. 
methods of personnel selection to 
Navy, Army, Air Force and 
A.T.S. (now W.R:A.C.). recruits 
during. the war. It also covers 
officer selection by the well- 
known War Office Selection 
Boards. The Value of ‘each of the 
+ techniques. employed ~ the ques- 
‘tionnaire, the interview, paper- 
and-pencil and other types of 
aptitude, attainment, and tem- 
perament tests — is discussed in 
the light of pre-war investigations 
+ and war-time experience; but 
technical statistical details аге 
kept to a minimum. The chief aim 
of the book is to show what 
general principles" of vocational 
selection and guidance follow 
.from this vast experiment in 
applied psychology, and how they 
may be adapted to educational 
and industrial purposes today. 
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TE MEASUREMENT | 
OF ABILITIES 


New and Revised Edition 


Primarily written for teachers, clinic workers, school 
medical officers and psychology students, this book 
provides an elementary interpretation of the statistical 
techniques which are essential to a proper under- 
standing. of mental measurement and shows the 
application of these techniques to problems of testing 
and measurement in English and En schools and 
universities. 


This new edition brings the book up-to-date in respect 


of seleetion for secondary education and presents, in 


the light of developments in the field of intelligence 
_ theory and testing, a more reasonable picture of the 


relations of innate and acquired abilities and of the 


findings of factorial analysis than has hitherto been 
3 заре і to education and олу students. The 


ў | compl: etely reconstructed and several of the statistical 


[Te 


n secti ons have been revised and expanded. 


